Chapter 3: Bisonc++ concepts

This chapter introduces many of the basic concepts without which the details of bisonc++ do not make sense. If you do not already know how to use bisonc++, bison++ or bison, it is advised to start by reading this chapter carefully.

3.1: Languages and Context-Free Grammars

Bisonc++ can generate parsers for languages that are context-free. This means that you define one or more constructs, commonly called nonterminal symbols, and define rules by which they are recognized (or defined).

For example, C++ defines an `expression' nonterminal. One rule defining an expression might be,

"An expression consists of a minus sign and another expression".
Another rule could be
"An expression can be an integer".
Rules often are recursive, but there must be at least one rule ending the recursion.

A commonly used formal system for describing such rules is the Backus-Naur Form or `BNF', which was developed for specifying the language Algol 60. Any grammar expressed in BNF is a context-free grammar. The input to bisonc++ is essentially machine-readable BNF.

Not all context-free languages can be handled by bisonc++, only those that are LALR(1). In brief, this means that it must be possible to tell how to parse input while `looking ahead' to just a single token. Strictly speaking, that is a description of an LR(1) grammar, and LALR(1) involves some additional restrictions which are not further elaborated in the current context; but it in real life almost all LR(1) grammars are also LALR(1) grammars. See section 7.5 for more information on this.

In the formal rules defining a language's grammar, the grammar's constructs are called nonterminals. Some symbols are not further defined, but are directly made available when matching the input to a series of regular expressions. Matching the input to regular expressions is not the task of a parser, but of a lexical scanner (e.g., flexc++(1)). Lexical scanners provide the parser with symbols, called tokens or terminal symbols, which are taken for granted by the parser.

Terminal and nonterminal symbols are used to define the rules of grammars.

We can use examples from C++ to illustrate terminal and nonterminal symbols. C++'s terminal symbols are identifiers, constants (numeric and string), various keywords, arithmetic operators, and other simple character tokens. So the terminal symbols of a grammar for C++ can thus be `IDENTIFIER', `NUMBER', `STRING', as wel as tokens for keywords (e.g., FOR, SWITCH, operators ('+', BIT_OR), and character tokens like ';', ':', '(', ')' ).

Here is a simple C++ function subdivided into tokens:

 
    int square(int x)   // INT IDENTIFIER '(' INT IDENTIFIER ')'
    {                   // '{'
        return x * x;   // RETURN IDENTIFIER '*' IDENTIFIER ';'
    }                   // '}'
        

C++ rules include expression rules, statement rules, rules defining declarations, and rules defining functions. These are defined in C++'s grammar as nonterminal symbols `expression', `statement', `declaration' and `function_definition'. The above example shows a function definition as well as the sequence of tokens (starting from INT, ending at '}'), received by the parser when that definition shows up in its input.

Each nonterminal symbol must be defined by at least one so-called production rule, defining the sequence of symbols that must have been observed before it can be decided that the nonterminal symbol has been encountered. For example, one kind of C++ statement is the return statement; it could be described by the following informal rule:

A `statement' can be made of a `return' keyword, an `expression' and a `semicolon'.

Formally, such a rule is written like this:


    statement:
        RETURN expression ';'
    ;
    
and in general grammar rules have the following form:
 
    nonterminal: 
        production(s)
    ;
        
where nonterminal is the nonterminal symbol defined by the rule and productions(s) are one or more sequences of terminal and nonterminal symbols that define how nonterminal can be recognized. Concentrating on a single production rule, the rule's nonterminal is called the rule's left-hand side (LHS), the production rule itself is called the rule's right-hand side (RHS). In the statement example the LHS equals statement, the RHS equals RETURN expression ';'. The RHS consists of three elements, numbered, respectively, 1 through 3.

One nonterminal symbol is special in that it defines a syntactically correct specification of the defined language. It is called the start symbol. In a compiler this means a complete input program. In C++, a nonterminal symbol `optional-sequence-of-definitions-and-declarations' could be used for this.

The parser generated by bisonc++ reads a sequence of tokens, and combines the tokens using the rules of the specified grammar. If the input is valid, the end result is that the entire token sequence reduces to the start symbol nonterminal. If we use C++'s grammar, then the entire input must reducible to `optional-sequence-of-definitions-and-declarations'. If not, the parser reports a syntax error.

3.2: From Formal Rules to Bisonc++ Input

A formal grammar is a mathematical construct. To define the language for Bisonc++, you must write a file expressing the grammar in bisonc++ syntax: a Bisonc++ grammar file. See chapter 4.

A nonterminal symbol in the formal grammar is represented in bisonc++ input as an identifier, like an identifier in C++. By convention, it should be in lower case, such as expr, stmt or declaration.

The bisonc++ representation for a terminal symbol is also called a token type. Token types as well can be represented as C++-like identifiers. By convention, these identifiers should be upper case to distinguish them from nonterminals: for example, INTEGER, IDENTIFIER, IF or RETURN. A terminal symbol that stands for a particular keyword in the language should be named after that keyword converted to upper case. The terminal symbol error is reserved for error recovery. See section 4.2, which also describes the current restrictions on the names of terminal symbols.

A terminal symbol can also be represented as a character literal, just like a C++ character constant. You should do this whenever a token is just a single character (parenthesis, plus-sign, etc.): use that same character in a literal as the terminal symbol for that token.

The grammar rules also have an expression in bisonc++ syntax. For example, here is the bisonc++ rule for a C++ return statement. The semicolon in quotes is a literal character token, representing part of the C++ syntax for the statement; the naked semicolon, and the colon, are bisonc++ punctuation used in every rule.


    stmt:   
        RETURN expr ';'
    ;
        
See section 4.3.

3.3: Semantic Values

A formal grammar selects tokens only by their classifications: for example, if a rule mentions the terminal symbol `integer constant', it means that any integer constant is grammatically valid in that position. The precise value of the constant is irrelevant when input is being parsed: if `x + 4' is grammatically correct then `x + 1' or `x + 3989' is also grammatically correct.

But actual values are very important for understanding the input's meaning. A compiler is useless if it fails to distinguish between 4, 1 and 3989 as constants that are being used in a program. Therefore, each token in a bisonc++ grammar has both a token type and a semantic value. See section 4.6 for details.

A token is a terminal symbol defined in the grammar, such as INTEGER, IDENTIFIER or ','. Tokens are used by the parser to make decisions about the continuation of the parsing process, but for that only the mere tokens are required, and nothing else.

Semantic values represent any additionally available information about tokens, such as the values of integers, or the names of identifiers. (A token such as ',' which is just punctuation normally doesn't have a particular semantic value.)

For example, an token might be an INTEGER, having the semantic value 4. Another INTEGER token could have semantic value 3989. When a grammar rule indicates that INTEGER is allowed, either of these token/value combinations is acceptable because each token is INTEGER. When the parser accepts the token, it may store the token's semantic value for future reference on a semantic value stack.

Nonterminal symbols often also have semantic values. For example, in a calculator, an expression typically has a semantic value that is a number. In a compiler for a programming language, an expression typically has a semantic value that is a tree structure describing the meaning of the expression.

3.4: Semantic Actions

A program usually must do more than just parse its input; it must also produce some output based on the input. In a bisonc++ grammar, a grammar rule can have an action consisting of C++ statements. Each time the parser recognizes a match for that rule, its associated action is executed. See section 4.6.2.

The purpose of an action usually is to compute the semantic value of the whole construct from the semantic values of its parts. For example, suppose we have a rule stating that an expression can be the sum of two expressions. Once the parser has recognized the rule's parts, each of its subexpressions has a semantic value of its own. The action for this rule could consist of a C++ expression computing the rule's semantic value from its subexpressions.

Actions usually compute semantic values and refer to semantic values of elements of production rules. This happens so frequently that several shorthand notations can be used to simplify referring to semantic values. These shorthand notations are covered extensively in section 4.6.2 and its sub-sections.

For example, here is a rule stating that an expression can be the sum of two subexpressions:


    expr: expr '+' expr   
        { 
            $$ = $1 + $3; 
        }
    ;
        
The action defines how the semantic value of the rule (using shorthand $$) is computed from the values of the two sub-expressions (shorthands $1 and $3).

3.5: Bisonc++ output: the Parser class

When bisonc++ is run, it processes a grammar file. If the grammar is error-free bisonc++ produces a C++ class, in which several members have already been defined. Therefore, bisonc++'s output consists of header files and a C++ source file, defining a member (parse) that parses the language described by the grammar. The class and its implementation is called a bisonc++ parser class. Keep in mind that the bisonc++ utility and the produced parser classes are distinct pieces of software: the bisonc++ utility is a program whose output is the bisonc++ parser class that is used by your program.

More specifically, bisonc++ generates the following files from a bisonc++ grammar file:

The task of bisonc++'s parse member is to group tokens according to rules defined in the grammar -- for example, to combine identifiers and operators into expressions. As it does this, it executes the actions of the grammar rules it has recognized.

The tokens processed by the parser produced by bisonc++ are made available by an object called the lexical analyzer or lexical scanner. The scanner is not produced by bisonc++, but must somehow be supplied (e.g., by writing it yourself). The parse member requests the next token from the lexical analyzer whenever it needs another token. The parser itself doesn't know the semantic values of the received tokens. Typically the lexical scanner produces tokens by parsing the characters of the program's input text. This, however, is not something that bisonc++ concerns itself with. See also section 5.3.1.

Bisonc++'s parsing function is a member function named parse. This parsing function nor the parser object for which it is called defines a complete C++ program: you must supply some additional details. One `detail' to be supplied is is the lexical analyzer. The parser class itself declares several additional members which can easily be redefined to fit your needs. One of these additional members is the error-reporting function called by parse to report errors. By default bisonc++ provides simple, but sensible, implementations for such members.

Having constructed a parser class and a lexical scanner class, objects of those classes must be defined in a complete C++ program. Usually such objects are defined in main; you have to provide this function, and arrange for it to call the parser's parse function, lest the parser never run. See chapter 5.

Note that, different from conventions used by Bison and Bison++, bisonc++ does not impose a particular name convention. In particular, there is no need to begin all variable and function names used in the bisonc++ parser with `yy' or `YY'. However, some name restrictions on symbolic tokens exist. See section 4.5.27.1 for details.

3.5.1: Bisonc++: an optionally reentrant Parser

A computer program or routine is called `reentrant' if it can safely recursively and concurrently be called from multiple processes. To be reentrant, a function must modify no static data, must not return a pointer to static data, must work only on the data provided to it by the caller, and must not call non-reentrant functions (Source: http://en.wikipedia.org/wiki/Reentrant).

Currently, bisonc++'s parsing member function parse may or may not be reentrant, depending on whether or not the option --thread-safe is specified.

The source file generated by bisonc++ containing the parsing member function not only contains this function, but also various tables (e.g., state transition tables) defined in the anonymous namespace. When the option --thread-safe is provided, these tables are const tables: their elements are not changed by the parsing function and so the parsing function, as it only manipulates its own local data, becomes reentrant.

3.6: Using Bisonc++: main steps

The design process when using bisonc++, from grammar specification to a working compiler or interpreter, consists of the following steps: Once the program's source files are available, they must be compiled and linked togrether, producing the final program.

3.7: The Overall Layout of a Bisonc++ Grammar File

Bisonc++ processes a file containing a grammar definition. Different from Bison++ and Bison grammar files, bisonc++'s grammar file consists of only two sections. The general form of a bisonc++ grammar file is as follows:

    Bisonc++ directives
    %%
    Grammar rules
        
Readers familiar with Bison may note that there is no C declaration section and no section to define Additional C code. With bisonc++ these sections are superfluous since, due to the fact that a bisonc++ parser is a class, all additionally required code can be incorporated into the parser class itself. Also, C++ classes normally only require declarations that can be provided in the classes' header files, so also the `additional C code' section was omitted from the grammar file.

The `%%' is a punctuation that appears in every bisonc++ grammar file to separate the two sections.

The bisonc++ directives section contains all necessary directive specifications, like directives declaring the names of the terminal and nonterminal symbols, and directives describing operator precedence, and the data types of semantic values of various symbols.

The file's second section (grammar rules) is used to define how to construct each nonterminal symbol from its parts.

One special directive is availble that may be used in both the directives section and in the grammar rules section. This directive is %include, allowing you to split long grammar specifications into smaller, more comprehensible and accessible chunks. The %include directive is discussed in more detail in section 4.5.8.