Cetus is written in Java, so it is natural to use ANTLR to generate parsers whenever possible. Cetus comes with an ANTLR parser for C. We determined that ANTLR cannot be used for C++. We are aware that there is a C++ grammar on the ANTLR website, but it is incomplete and we wanted a grammar that matched the standard grammar in Stroustrup's book as much as possible.
Parsing intentionally was separated from the IR-building methods in the high-level interface so that other front ends could be added independently. Some front ends may require more effort than others. For example, writing a parser for C++ is a challenge because its grammar does not fit easily into any of the grammar classes supported by standard generators. The GNU C++ compiler was able to use an LALR(1) grammar, but it looks nothing like the ISO C++ grammar. If any rules must be rearranged to add actions in a particular location, it must be done with extreme care to avoid breaking the grammar. Another problem is C++ has much more complicated rules than C as far as determining which symbols are identifiers versus type names, requiring substantial symbol table maintenance while parsing.
The C language was the original focus of the Cetus project.
There is nothing unusual about the C scanner. It is made with flex, and directly follows the ANSI C standard. It accepts a few GCC-specific keywords, as well as the C99 restrict keyword.
C++ was the primary reason for allowing separate parsers, as discussed above.
Even the C++ scanner is difficult. The < token can mean either less than or begin a template specification. Without complete symbol table information, it is not possible to perfectly disambiguate between the two cases. However, building complete symbol table information in the C++ parser and feeding it back to the scanner defeats the purpose of having a separate parser, because actions become large and the grammar may need modified to provide places for new actions. Modifying the C++ grammar is no easy task.
There is a heuristic in the scanner for performing the disambiguation. When it encounters a <, it saves the state of the scanner, and then looks at subsequent tokens until it can decide which token the < is. The lookahead is bounded and will terminate with an error message if it exceeds the bounds. After the appropriate token has been chosen, the state of the scanner is restored and the token is returned to the parser. The heuristic works correctly for the C++ template library, but it could always be improved.
We are extending Cetus for C++ by using a Generalized LR (GLR, also called stack-forking or Tomita parsing) parser generator. Such parsers allow grammars that accept any language and defer semantic analysis to a later pass. GLR support has recently been added to GNU Bison and provides a way to create a C++ parser that accepts the entire language without using a symbol table. An important benefit is the grammar can be kept very close to the ISO grammar. We have developed a parser for the complete C++ language plus some GCC extensions using Bison. We believe it is due to the language's complexity that there are fewer research papers dealing with C++ than with other languages, despite C++'s wide use in industry. The above reasons should allow Cetus to provide an easy-to-use C++ infrastructure, making it a very important research tool.
The output of either the C parser or C++ parser is a parse tree file. The parse tree file is compatible with the graphviz package. It is possible to visualize the parse trees using graphviz, but only for small programs. Parse trees become unmanageably large for programs of more than a few hundred lines, however graphviz was useful for verifying the parser worked correctly by examining parse trees for small chunks of code.