Chapter 8: Error Recovery

Usually it is not acceptable to have a program terminate on a parse error. For example, a compiler should recover sufficiently to parse the rest of the input file and check it for errors; a calculator should accept another expression. Such errors violate the grammar for which the parser was constructed and are called syntactic errors. Other types of errors are called semantical errors: here the intended meaning of the language is not observed. For example, a division by too small a numeric constant (e.g., 0) may be detected by the parser compile time. In general, what can be detected compile time should not left for the run-time to detect, and so the parser should flag an error when it detects a division by a very small numerical constant. Bisonc++'s parsers may detect both syntactic and semantical errors. Syntactical errors are detected automatically, while the parser performs its parsing-job, semantical errors must explicitly be defined when the grammar is constructed. The following sections cover the way Bisonc++'s parser may handle syntactic errors and semantical errors, respectively.

8.1: Syntactic Error Recovery

In a simple interactive command parser where each input is one line, it may be sufficient to allow parse() to return PARSE_ABORT on error and have the caller ignore the rest of the input line when that happens (and then call parse() again). But this is inadequate for a compiler, because it forgets all the syntactic context leading up to the error. A syntactic error deep within a function in the compiler input should not cause the compiler to treat the following line like the beginning of a source file.

It is possible to specify how to recover from a syntactic error by writing rules recognizing the special token error. This is a terminal symbol that is always defined (it must not be declared) and is reserved for error handling. The bisonc++ parser generates an error token whenever a syntactic error is detected; if a rule was provided recognizing this token in the current context, the parse can continue. For example:

        
    statements:  
        // empty
    | 
        statements '\n'
    | 
        statements expression '\n'
    | 
        statements error '\n'

The fourth rule in this example says that an error followed by a newline makes a valid addition to any statements.

What happens if a syntactic error occurs in the middle of an expression? The error recovery rule, interpreted strictly, applies to the precise sequence of a statements, an error and a newline. If an error occurs in the middle of an expression, there will probably be some additional tokens and subexpressions on the parser's stack after the last statements, and there will be tokens waiting to be read before the next newline. So the rule is not applicable in the ordinary way.

bisonc++, however, can force the situation to fit the rule, by discarding part of the semantic context and part of the input. When a (syntactic) error occurs the parsing algorithm tries to recover from the error in the following way: First it discards states from the stack until it encounters a state in which the error token is acceptable (meaning that the subexpressions already parsed are discarded, back to the last complete statements). At this point the error token is shifted. Then, if the available look-ahead token is not acceptable to be shifted next, the parser continues to read tokens and to discard them until it finds a token which is acceptable. I.e., a token which can follow an error token in the current state. In this example, bisonc++ reads and discards input until the next newline was read so that the fourth rule can apply.

The choice of error rules in the grammar is a choice of strategies for error recovery. A simple and useful strategy is simply to skip the rest of the current input line or current statement if an error is detected:


    statement: 
        error ';'  // on error, skip until ';' is read

Another useful recovery strategy is to recover to the matching close-delimiter of an opening-delimiter that has already been parsed. Otherwise the close-delimiter probably appears to be unmatched, generating another, spurious error message:

    
    primary:  
        '(' expression ')'
    | 
        '(' error ')'
    |
        ...
    ;

Error recovery strategies are necessarily guesses. When they guess wrong, one syntactic error often leads to another. In the above example, the error recovery rule guesses that an error is caused by bad input within one statement. Suppose that instead a spurious semicolon is inserted in the middle of a valid statement. After the error recovery rule recovers from the first error, another syntactic error will immediately be found, since the text following the spurious semicolon is also an invalid statement.

To prevent an outpouring of error messages, the parser may be configured in such a way that no error message are generated for another syntactic error that happens shortly after the first. E.g., only after three consecutive input tokens have been successfully shifted error messages may again be generated. This configuration is currently not available in bisonc++'s parsers.

Note that rules using the error token may have actions, just as any other rules can.

The token causing an error is re-analyzed immediately when an error occurs. If this is unacceptable, then the member function clearin() may be called to skip this token. The function can be called by any member function of the Parser class. For example, suppose that on a parse error, an error handling routine is called that advances the input stream to some point where parsing should once again commence. The next symbol returned by the lexical scanner is probably correct. The previous token ought to be discarded using clearin().

8.1.1: Error Recovery

Bisonc++ implements a simple error recovery mechanism. When the lookup() function cannot find an action for the current token in the current state it throws an UNEXPECTED_TOKEN_ exception.

This exception is caught by the parsing function, calling the errorRecovery() member function. By default, this member function terminates the parsing process. The non-default recovery procedure is available once an error token is used in a production rule. When the parsing process throws UNEXPECTED_TOKEN_ the recovery procedure is started (i.e., it is started whenever a syntactical error is encountered or ERROR() is called).

The recovery procedure consists of

looking for the first state on the state-stack having an error-production, followed by:
handling all state transitions that are possible without retrieving a terminal token.
then, in the state requiring a terminal token and starting with the initial unexpected token (3) all subsequent terminal tokens are ignored until a token is retrieved which is a continuation token in that state.

If the error recovery procedure fails (i.e., if no acceptable token is ever encountered) error recovery falls back to the default recovery mode (i.e., the parsing process is terminated).

Not all syntactic errors are always reported: the option --required-tokens can be used to specify the minimum number of tokens that must have been successfully processed before another syntactic error is reported (and counted).

The option --error-verbose may be specified to obtain the contents of the state stack when a syntactic error is reported.

The example grammar may be provided with an error production rule:


    %token NR
    
    %left '+'
    
    %%
    
    start:
        start expr
    |
        // empty
    ;
    
    expr:
        error
    |
        NR
    |
        expr '+' expr
    ;

The resulting grammar has one additional state (handling the error production) and one state in which the ERR_ITEM flag has been set. When and error is encountered, this state obtains tokens until a token having a valid continuation was received, after which normal processing continues.

In the parser's verbose output (using the -V option) the various grammar rules and state transition tables are shown. The debug output below refers to this information.

The rules are:


    1: start ->  start expr
    2: start ->  <empty>
    3: expr (errTok_) ->  errTok_
    4: expr (NR) ->  NR
    5: expr ('+') ->  expr '+' expr
    6: start_$ ->  start

The state-transitions are:


State 0:
6: start_$ ->  . start 
  0:   On start to state 1
  Reduce by 2: start ->  . 


State 1:
6: start_$ -> start  . 
1: start -> start  . expr 
  0:   On expr to state 2
  1:   On errTok_ to state 3
  2:   On NR to state 4


State 2:
1: start -> start expr  . 
5: expr -> expr  . '+' expr 
  0:   On '+' to state 5
  Reduce by 1: start -> start expr  . 


State 3:
3: expr -> errTok_  . 
  Reduce by 3: expr -> errTok_  . 


State 4:
4: expr -> NR  . 
  Reduce by 4: expr -> NR  . 


State 5:
5: expr -> expr '+'  . expr 
  0:   On expr to state 6
  1:   On errTok_ to state 3
  2:   On NR to state 4


State 6:
5: expr -> expr '+' expr  . 
5: expr -> expr  . '+' expr 
  0 (removed by precedence):   On '+' to state 5
  Reduce by 5: expr -> expr '+' expr  .

The following output from the parse() function, generated by bisonc++ using the --debug option illustrates error recovery for the above grammar, entering the input


    a
    3+a

results in:


    parse(): Parsing starts
    
PUSH 0 (initializing the state stack)
    

LOOKUP: [0, 'Reserved_::UNDETERMINED_'] -> default reduce using rule 2
    
REDUCE: rule 2
    execute action 2 ...
    ... completed
    rule 2: pop 0 elements. Next will be: [0, 'start']
    pop 0 elements from the stack having size 1
    next: [0, 'start']
    

PUSH:   [0, 'start'] -> 1
    
scanner token `a' (97)
    
ERROR:  [1, `a' (97)] -> ??. Errors: 1
Syntax error
    Reached ERROR state 1
    
PUSH:   [1, 'errTok_'] -> 3
    

LOOKUP: [3, `a' (97)] -> default reduce using rule 3
    
REDUCE: rule 3
    rule 3: pop 1 elements. Next will be: [1, 'expr']
    pop 1 elements from the stack having size 3
    next: [1, 'expr']
    
available token 'expr'
    
PUSH:   [1, 'expr'] -> 2
    
available token `a' (97)
    
LOOKUP: [2, `a' (97)] -> default reduce using rule 1
    
REDUCE: rule 1
    rule 1: pop 2 elements. Next will be: [0, 'start']
    pop 2 elements from the stack having size 3
    next: [0, 'start']
    

PUSH:   [0, 'start'] -> 1
    
available token `a' (97)
    
scanner token 'NR'
    
PUSH:   [1, 'NR'] -> 4
    ERROR RECOVERED: next state 4
    

LOOKUP: [4, 'Reserved_::UNDETERMINED_'] -> default reduce using rule 4
    
REDUCE: rule 4
    execute action 4 ...
    ... completed
    rule 4: pop 1 elements. Next will be: [1, 'expr']
    pop 1 elements from the stack having size 3
    next: [1, 'expr']
    
available token 'expr'
    
PUSH:   [1, 'expr'] -> 2
    
scanner token 'Reserved_::EOF_'
    
LOOKUP: [2, 'Reserved_::EOF_'] -> default reduce using rule 1
    
REDUCE: rule 1
    execute action 1 ...
    ... completed
    rule 1: pop 2 elements. Next will be: [0, 'start']
    pop 2 elements from the stack having size 3
    next: [0, 'start']
    

PUSH:   [0, 'start'] -> 1
    
available token 'Reserved_::EOF_'
    ACCEPT(): Parsing successful
    parse(): returns 0 or 1

The final debug message should be interpreted as a C++ expression: the 0 indicates that Parse::ACCEPT rather than Parser::ABORT was called, the 1 shows the value of d_nErrors_. Consequently, parse() returns 1 (i.e., `0 or 1').

8.2: Semantic Error Recovery

Semantic error recovery once again requires judgment on the part of the grammar-writer. For example, an assignment expression may be syntactically defined as


    expr '=' expr

The assignment operator's left-hand side must be a so-called lvalue. An lvalue is simply an addressable location, like a variable's identifier, a dereferenced pointer expression or some other address-expression. The right-hand side is a so-called rvalue: this may be any value: any expression will do.

A rule like the above leaves room for many different semantic errors:

Since the rule's LHS equals expr, any expression is accepted by the parser. E.g.,
```
    3 = 12
        
```
So, the action associated with this rule should check whether the expression's left-hand side is actually an lvalue. If not, a semantic error should be reported;
In a typed language (like C++), not all assignments are possible. E.g., it is not acceptable to assign a std:string value to a double variable. When conflicting types are used, a semantic error should be reported;
In a language requiring variables to be defined or declared before they are used (like C++) the parser should check whether a variable is actually defined or declared when it is used in an expression. If not, a semantic error should be reported

A parser that should be able to detect semantic errors normally uses a counter counting the number of semantic errors, e.g.,

size_t
d_nSemanticErrors

. It may be possible to test this counter's value once the input has been parsed, calling ABORT() (see section 5.3) if the counter isn't zero anymore. When the grammar's start symbol itself has multiple alternatives, it is probably easiest to augment the grammar with an additional rule, becoming the augmented grammar's start symbol which simply calls the former start symbol. For example, if input was the name of the original start-symbol, augment the grammar as follows to ensure a PARSE_ABORT return value of the parse() member when either syntactic or semantic errors were detected:


    semantic_input:                 // new start-symbol
        input
        {
            if (d_nSemanticErrors)  // return PARSE_ABORT
                ABORT();            // on semantic errors too.
        }

Returning from the parser's parse() member the number of syntactic and semantic errors could then be printed, whereupon the program might terminate.