Cetus' IR contrasts with the Polaris Fortran translator's IR in that it uses a hierarchical statement structure. The Cetus IR directly reflects the block structure of a program. Polaris lists the statements of each procedure in a flat way, with a reference to the outer statement being the only way for determining the block structure. There are also important differences in the representation of expressions, which further reflects the differences between C and Fortran. The Polaris IR includes assignment statements, whereas Cetus represents assignments in the form of expressions. This corresponds to the C language's feature to include assignment side effects in any expression.
The IR is structured such that the original source program can be reproduced, but this is where source-to-source translators face an intrinsic dilemma. Keeping the IR and output similar to the input will make it easy for the user to recognize the transformations applied by the compiler. On the other hand, keeping the IR language independent leads to a simpler compiler architecture, but may make it impossible to reproduce the original source code as output. In Cetus, the concept of statements and expressions are closely related to the syntax of the C language, facilitating easy source-to-source translation. However, the drawback is increased complexity for pass writers (since they must think in terms of C syntax) and limited extensibility of Cetus to additional languages. That problem is mitigated by the provision of several abstract classes, which represent generic control constructs. Generic passes can then be written using the abstract interface, while more language-specific passes can use the derived classes. We feel it is important to work with multiple languages at an early stage, so that our result is not simply a design that is extensible in theory but also in practice. Toward this goal, we have begun adding a C++ front end and generalizing the IR so that we can evaluate these design trade-offs. Other potential front ends are Java and Fortran 90.
Our design goal was a simple IR class hierarchy easily understood by users. It should also be easy to maintain, while being rich enough to enable future extension without major modification. The basic building blocks of a program are the translation units, which represent the content of a source file, and procedures, which represent individual functions. Procedures include a list of simple or compound statements, representing the program control flow in a hierarchical way. That is, compound statements, such as IF-constructs and FOR-loops include inner (simple or compound) statements, representing then and else blocks or loop bodies, respectively. Expressions represent the operations being done on variables, including the assignments to variables.
When designing any class hierarchy, some general principles are followed. The main principle is that a derived class satisifies an is a relationship with its parent, such that the data and methods of the parent make sense in the context of the child. This principle is followed to some extent in Cetus, but it would be more accurate to say that a derived class can appear in the grammar wherever its parent can appear. For example, AccessLevel and Annotation are derived from Declaration, however neither of them can be used to declare new symbols. They are children simply because they may appear in the same location in the grammar (for AccessLevel) or in the IR (for Annotation) as their parent.
There is a distinction between the class hierarchy and the syntax tree. For example, in the syntax tree, the parent of a TranslationUnit is a Program, however neither TranslationUnit nor Program have a parent in the class hierarchy.
One important aspect that makes an infrastructure useful is providing a good set of tools to help debug future compiler passes. Cetus provides basic debugging support through the Java language, which contains exceptions and assertions as built-in features. Cetus executes within a Java virtual machine, so a full stack trace including source line numbers is available whenever an exception is caught or the compiler terminates abnormally.
Such error reporting is useless unless the IR is designed to prevent programmers from corrupting the program representation. In other words, there must be a way of detecting the state of the IR is not as it should be, in order to report an error. To provide error checking and detect common errors by pass writers, the Cetus IR maintains several invariants. Violating one of the invariants below will probably result in an exception, to the extent that it is possible for Cetus to recognize what you have done.
Invariant
If an object has a parent, then it has exactly one parent.
Consequences
You may not take an object that has a parent and add it as the child of another object.
If you want to use the same object in more than one place in the syntax tree, you must clone the original object.
Clones are identical to the originals except their parent is null.
Invariant
An object P can enumerate all of its children.
Consequences
An iterator object can be created with P that can iterate over P's children. In some cases, the iterator will not visit data that is actually part of P itself. Nearly everything is kept in the list of children, and we may more move data into the list in the future.
Invariant
An object P controls the addition and removal of its children.
Consequences
An object C cannot become the child of an object P without P's permission.
Before an object C can set its parent reference to an object P, P must already recognize C is a child. I.e. C must already be in the list of P's children.
The child reference and parent reference (i.e. downlink and uplink) must be set in that order. The addXYZ methods of many classes will take care of this for you. There are also atomic swap methods that respect the ordering.
All of the classes are documented in detail with javadoc. That documentation can be found at http://www.ece.purdue.edu/ParaMount/Cetus/docs/.
A brief discussion of important base classes is below.
A TranslationUnit represents a single source file. It may only appear as the immediate child of the Program.
Declarations appear in many places. As children of TranslationUnit, they represent global declarations of variables, functions, or classes. As parameters of a Procedure, they represent declarations of formal parameters. As children of ClassDeclaration, they represent methods and data members.
Expressions are normally of most interest to optimization passes. All assignments, function calls, and mathematical computations are Expressions.
Statements serve two purposes: to provide control constructs, and to provide wrappers for Declarations and Expressions. Statements such as IfStatement provide a control construct, whereas DeclarationStatement and ExpressionStatement simply allow Declarations and Expressions to appear wherever a Statement may appear.