flexc++input(7)

flexc++ input file organization
(flexc++.2.07.00.tar.gz)

2008-2018

NAME

flexc++input - Organization of flexc++'s input s

DESCRIPTION

Flexc++(1) was designed after flex(1) and flex++(1). Like these two programs flexc++ generates code performing pattern-matching on text, possibly executing actions when certain regular expressions are recognized.

Refer to flexc++(1) for a general overview. This manual page describes how flexc++'s input s should be organized. It contains the following sections:

UNDERSCORES

Starting with version 2.07.00 flexc++ reserved identifiers no longer end in two underscore characters, but in one. This modification was necessary because according to the C++ standard identifiers having two or more consecutive underscore characters are reserved by the language. In practice this could require some minor modifications of existing source files using flexc++'s facilities, most likely limited to changing StartCondition__ into StartCondition_ and changing PostEnum__ into PostEnum_.

The complete list of affected names is:

Enums:
ActionType_, Leave_, StartConditon_, PostEnum_;
Member functions:
actionType_, continue_, echoCh_, echoFirst_, executeAction_, getRange_, get_, istreamName_, lex_, lop1_, lop2_, lop3_, lop4_, lopf_, matched_, noReturn_, print_, pushFront_, reset_, return_;
Protected data members:
d_in_ d_token_ s_finIdx_, s_interactive_, s_maxSizeofStreamStack_, s_nRules_, s_rangeOfEOF_, s_ranges_, s_rf_.

1. SPECIFICATION FILE(S)

Flexc++ expects an input file containing directives and the regular expressions that should be recognized by objects of the scanner class generated by flexc++. In this man page the elements and organization of flexc++'s input file is described.

Flexc++'s input file consists of two sections, separated from each other by a line merely containing two consecutive percent characters:


%%
    
The section before this separator contains directives; the section following this separator contains regular expressions and possibly actions to perform when these regular expressions are matched by the object of the scanner class generated by flexc++. If a second line is encountered immediately beginning with two consecutive percent characters then this ends flexc++'s input file processing. See also section 6 (%% SEPARATOR) below.

White space is usually ignored, as is comment, which may be of the traditional C form (i.e., /*, followed by (possibly multi-line) comment text, followed by */, and it may be C++ end-of-line comment: two consecutive slashes (//) start the comment, which continues up to the next newline character.

2. FILE SWITCHING

Flexc++'s input file may be split into multiple files. This allows for the definition of logically separate elements of the specifications in different files. Include directives must be specified on a line of their own. To switch to another specification file the following stanza is used:


//include file-location
        
The //include directive starts in the line's first column. File locations can be absolute or relative to the location of the file containing the //include directive. White space characters following //include and before the end of the line are ignored. The file specification may be surrounded by double quotes, but these double quotes are not required and are ignored (removed) if present. All remaining characters are expected to define the name of the file where flexc++'s rules specifications continue. Once end of file of a sub-file has been reached, processing continues at the line beyond the //include directive of the previously scanned file. The end-of-file of the file that was initially specified when flexc++ was called indicates the end of flexc++'s rules specification.

3. DIRECTIVES

The first section of flexc++'s input file consists of directives. In addition it may associate regular expressions with symbolic names, allowing you to use these identifiers in the rules section. Each directive is defined on a line of its own. When available, directives are overridden by flexc++ command line options.

Some directives require arguments, which are usually provided following separating (but optional) = characters. Arguments of directives are text, surrounded by double quotes (strings), or embedded in raw string literals (rawstrings). Double quotes or backslashes inside strings must themselves be preceded by backslashes; these backslashes are not required when rawstrings are used.

The %s and %x directives are immediately followed by name lists, consisting of identifiers separated by blanks. Here is an example of the definition of a directive:


    %class-name = "MyScanner"
        

Directives accepting a `filename' do not accept path names, i.e., they cannot contain directory separators (/); options accepting a 'pathname' may contain directory separators. A 'pathname' using blank characters should be surrounded by double quotes.

Some directives may generate errors. This happens when a directive conflicts with the contents of an existing file which flexc++ cannot modify (e.g., a scanner class header file exists, but doesn't define a name space, but a %namespace directive was provided). To solve the error the offending directive could be omitted, the existing file could be removed, or the existing file could be hand-edited according to the directive's specification. Note that flexc++ currently does not handle the opposite error condition: if a previously used directive is omitted, then flexc++ does not detect the inconsistency. In those cases you may encounter compilation errors.

4. MINI SCANNERS

Mini scanners come in two flavors: inclusive mini scanners and exclusive mini scanners. The rules that apply to an inclusive mini scanner are the mini scanner's own rules as well as the rules which apply to no mini scanners in particular (i.e., the rules that apply to the default (or INITIAL) mini scanner). Exclusive mini scanners only use the rules that were defined for them.

To define an inclusive mini scanner use %s, followed by one or more identifiers specifying the name(s) of the mini-scanner(s). To define an exclusive mini scanner use %x, followed by or more identifiers specifying the name(s) of the mini-scanner(s). The following example defines the names of two mini scanners: string and comment:


    %x string comment 
        
Following this, rules defined in the context of the string mini scanner (see below) will only be used when that mini scanner is active.

A flexc++ input file may contain multiple %s and %x specifications.

5. DEFINITIONS

Definitions are of the form


identifier  regular-expression
        
Each definition must be entered on a line of its own. Definitions associate identifiers with regular expressions, allowing the use of ${identifier} as synonym for its regular expression in the rules section of flexc++'s input file. One defined, the identifiers representing regular expressions can also be used in subsequent definitions.

Example:


FIRST                   [A-Za-z_]
NAME                    {FIRST}[-A-Za-z0-9_]*
        

6. %% SEPARATOR

Following directives and definitions a line merely containing two consecutive % characters is expected. Following this line the rules are defined. Rules consist of regular expressions which should be recognized, possibly followed by actions to be executed once a rule's regular expression has been matched.

If the rule section contains a line starting with two consecutive % characters, then any remaining input is ignored. Note that this second %% separator does not have to be specified. It is purely optional. To specify a regular expression starting with %% surround the %% with double quotes ("%%") or prefix the %% with a blank space: the %%-characters are only considered a separator if they are encountered at the very beginning of a line.

7. REGULAR EXPRESSIONS

The regular expressions defined in flexc++'s rules files are matched against the information passed to the scanner's lex function.

Regular expressions begin as the first non-blank character on a line. Comment is interpreted as comment as long as it isn't part of the regular expresssion. To define a regular expression starting with two slashes (at least) the first slash can be escaped or double quoted. (E.g., "//".* defines C++ comment to end-of-line).

Regular expressions end at the first blank character (to add a blank character, e.g., a space character, to a regular expression, prefix it by a backslash or put it in a double-quoted string).

Actions may be associated with regular expressions. At a match the action that is associated with the regular expression is executed, after which scanning continues when the lexical scanning function (e.g., lex) is called again. Actions are not required, and regular expressions can be defined without any actions at all. If such action-less regular expressions are matched then the match is performed silently, after which processing continues.

Flexc++ tries to match as many characters of the input file as possible (i.e., it uses `greedy matching'). Non-greedy matching is accomplished by a combination of a scanner and parser and/or by using the `lookahead' operator (/).

The following regular expression `building blocks' are available. More complex regular expressions are created by combining them:

x
the character `x';
.
any character (byte) except newline;
[xyz]
a character class; in this case, the pattern matches either an `x', a `y', or a `z'. See also the paragraph about character classes below;
[abj-oZ]
a character class containing a range; matches an `a', a `b', any letter from `j' through `o', or a `Z'. See also the paragraph about character classes below;
[^A-Z]
a negated character class, i.e., any character except for those in the class. In this example, any non-capital character. See also the paragraph about character classes below;
"[xyz]\"foo"
text between double quotes matches the literal string: [xyz]"foo;
R"([xyz]\"foo)"
the literal string `[xyz]\"foo' (using a raw string literal);
\X
if X is `a', `b', `f', `n', `r', `t', or `v', then the ANSI-C interpretation of `\x' is matched. Otherwise, a literal `X' is matched (this is used to escape operators such as `*');
\0
a NUL character (ASCII code 0);
\123
the character with octal value 123;
\x2a
the character with hexadecimal value 2a;
(r)
the regular expression `r'; parentheses are used to override precedence (see below);
{name}
the expansion of the `name' definition;
r*
zero or more regular expressions `r'. This also matches the empty string;
r+
one or more regular expressions `r';
r?
zero or one regular expression `r'. This also matches the empty string;
rs
the regular expression `r' followed by the regular expression `s'; called concatenation;
r{m, n}
regular expression `r' at least m, but at most n times (0 <= m <= n). A regular expression to which {0, 0} is appended is ignored, and a warning message is shown.
r{m,}
regular expression `r' m or more times (0 <= m);
r{m}
regular expression `r' exactly m times (0 <= m). A regular expression to which {0} is appended is ignored, and a warning message is shown;
r|s
either regular expression `r' or regular expression `s';
r/s
regular expression `r' if it is followed by regular expression `s'. The text matched by `s' is included when determining whether this rule results in the longest match, but `s' is then returned to the input before the rule's action (if defined) is executed.

If flexc++ detects patterns potentially not matching any text it generates warnings like this:


    [Warning] input, line 7: null-matching regular expression
        
By placing the comment

    //%nowarn
        
on the line just before a regular expression that potentially does not match any text, the warning for that regular expression is suppressed;

^r
a regular expression `r' at the beginning of a line or file;
r$
a regular expression `r', occurring at the end of a line. This pattern is identical to `r/\n';
<s>r
a regular expression `r' in start condition `s';
<s1,s2,s3>r
a regular expression `r' in start conditions s1, s2, or s3;
<*>r
a regular expression `r' in all start conditions;
<<EOF>>
an end-of-file;
<s1,s2><<EOF>>
an end-of-file when in start conditions s1 or s2 .

Character classes

Inside a character class all regular expression operators lose their special meanings, except for the escape character (\), the character range operator -, the end of character class operator ], and, at the beginning of the class, ^. All ordinary escape sequences are supported, all other escaped characters are interpreted as literal characters (e.g., \c is a literal c).

To add a closing bracket to a character class use [] or \]. To add a closing bracket to a negated character class use [^] (or use [^ followed by \] somewhere within the character class). Minus characters are used to define character ranges (e.g., [a-d], defining [abcd]) except in the following cases, where flexc++ recognizes a literal minus character: [-, or [^- (a minus at the very beginning of a character class); -] (a minus at the very end of a character class); or \- (an escaped minus character)) Once a character class has started, all subsequent character (ranges) are added to the set, until the final closing bracket (]) has been reached.

Operator precedence

The regular expressions listed above are grouped according to precedence, from highest precedence at the top to lowest at the bottom. From lowest to highest precedence, the operators are:

The lex standard defines concatenation as having a higher precedence than the interval expression. This is different from many other regular expression engines, and flexc++ follows these latter engines, giving all `multiplication operators' equal priority.

Name expansion has the same precedence as grouping (using parentheses to influence the precedence of the other operators in the regular expression). Since the name expansion is treated as a group in flexc++, it is not allowed to use the lookahead operator in a name definition (a named pattern, defined in the definition section).

Predefined sets of characters

Character classes can also contain character class expressions. These are expressions enclosed inside [: and :] delimiters (which themselves must appear between the [ and ] of the character class. Other elements may occur inside the character class as well). The character class expressions are:

     
     [:alnum:] [:alpha:] [:blank:]
     [:cntrl:] [:digit:] [:graph:]
     [:lower:] [:print:] [:punct:]
     [:space:] [:upper:] [:xdigit:]
        

Character class expressions designate a set of characters equivalent to the corresponding standard C isXXX function. For example, [:alnum:] designates those characters for which isalnum returns true - i.e., any alphabetic or numeric character. For example, the following character classes are all equivalent:

 
    [[:alnum:]]
    [[:alpha:][:digit:]]
    [[:alpha:][0-9]]
    [a-zA-Z0-9]
        

A negated character class such as the example [^A-Z] above will match a newline unless \n (or an equivalent escape sequence) is one of the characters explicitly present in the negated character class (e.g., [^A-Z\n]). This differs from the way many other regular expression tools treat negated character classes, but unfortunately the inconsistency is historically entrenched. Matching newlines means that a pattern like [^"]* can match the entire input unless there's another quote in the input.

Flexc++ allows negation of character class expressions by prepending ^ to the POSIX character class name.

                
    [:^alnum:] [:^alpha:] [:^blank:]
    [:^cntrl:] [:^digit:] [:^graph:]
    [:^lower:] [:^print:] [:^punct:]
    [:^space:] [:^upper:] [:^xdigit:]
        

Combining character sets

The {-} operator computes the difference of two character classes. For example, [a-c]{-}[b-z] represents all the characters in the class [a-c] that are not in the class [b-z] (which in this case, is just the single character a). The {-} operator is left associative, so [abc]{-}[b]{-}[c] is the same as [a].

The {+} operator computes the union of two character classes. For example, [a-z]{+}[0-9] is the same as [a-z0-9]. This operator is useful when preceded by the result of a difference operation, as in, [[:alpha:]]{-}[[:lower:]]{+}[q], which is equivalent to [A-Zq] in the C locale.

Trailing context

A rule can have at most one instance of trailing context (the / operator or the $ operator). The start condition, ^, and <<EOF>> patterns can only occur at the beginning of a pattern, and cannot be surrounded by parentheses. The characters ^ and $ only have their special properties at, respectively, the beginning and end of regular expressions. In all other cases they are treated as a normal characters.

8. SPECIFICATION EXAMPLE


%option debug

%x comment

NAME    [[:alpha:]][_[:alnum:]]*

%%

"//".*          // ignore

"/*"            begin(StartCondition_::comment);

<comment>.|\n   // ignore
<comment>"*/"   begin(StartCondition_::INITIAL);

^a              return 1;
a               return 2;
a$              return 3;
{NAME}          return 4;

.|\n            // ignore
        

)

FILES

Flexc++'s default skeleton files are in /usr/share/flexc++.
By default, flexc++ generates the following files:

SEE ALSO

flexc++(1), flexc++api(3)

BUGS

COPYRIGHT

This is free software, distributed under the terms of the GNU General Public License (GPL).

AUTHOR

Frank B. Brokken (f.b.brokken@rug.nl),
Jean-Paul van Oosten (j.p.van.oosten@rug.nl),
Richard Berendsen (richardberendsen@xs4all.nl) (until 2010).