New AST

These pages are for exploring ideas and designing a new AST and/or "Intermediate Representation" tree for further semantic analysis of 4GL/ABL source code.

The output of Proparse and ProRefactor will be used to generate an IR tree which is more appropriate for control flow, data flow, and other kinds of analysis.

Proparse creates a syntax tree which is great for lint, code documentation, and search-and-replace tasks. It is also very usable for resolving symbols and determining scopes - which is exactly what ProRefactor does.

It is common in compilers to find the use of an Intermediate Representation (IR), which is part way between parsing and generating target code (machine code, byte code, etc). Before the target code is written, the IR undergoes a series of "semantic preserving" transformations, such as constant propagation, dead code removal, and other optimizations.

Many of these optimizations on the IR require control flow and data flow analysis to be performed. These are the things that we are interested in.

This topic belongs to the Codeparse group, and we will use its forum/mailing-list for discussions.


Links and Reference Materials

Appel's book Modern Compiler Implementation in Java (second edition), is a likely candidate for a guideline for our new IR. It also leads directly into some of the analysis techniques that we will want to get into, including the use of SSA form (see below). The MiniJava project source code, especially the "Tree" package, may be a useful starting point.

Static Single Assignment form is an IR form which is ideal for the kinds of flow analysis that we want to perform.

Parrot is a virtual machine project, which is intended to become the virtual machine for Perl version 6, as well as be a viable virtual machine as a target for many other languages. Some of that project's documentation may be of interest to us, as they also have plans for providing an IR for compiler writers.

phc is a project to build a compiler with native Linux binaries (x86) as its target. Although its target is of no interest to us, some of the parallels between that project and some of our own projects are interesting. Their list of spinoffs looks like our marketing pages for Proparse. :) Note, especially, that "semantic checker" is just another term for "lint". Their whatsinstore page dives right into IR and what they plan to do with it. They plan to design their own IR, which may or may not be of interest to us. See this post in their mailing list for their motivation.

The Gnu Compiler Collection uses a tree SSA intermediate form for their compilers. This would be of a lot of interest to us, if it wasn't for the GPL. Anything we do here will have to be business friendly. Er, let me rephrase that. Anything *I* do here will be business friendly. Others, of course, are welcome to use these efforts here for playing with GCC as a target if they want. :)


Motivation

So why, exactly, do we want to do all this weird stuff?

The motivation comes from a surprisingly large number of goals, all of which depend on more thorough analysis of control flow and data flow in existing applications.

Prolint certainly comes to mind. There are many types of static code checking that could be done with more complete control and data flow analysis. For example, checking for problematic transaction scopes that span multiple compile units is difficult or impossible without a complete grasp of control flow.

Reverse engineering existing 4GL/ABL applications can hardly be considered complete without full data flow analysis. Work flow for data through an application is key to how well that application can be understood, yet there just aren't any tools which gather this information in a complete and detailed fashion.

Finally, re-engineering projects cannot be automated to any significant extent without better semantic understanding. When an OpenEdge application needs that "complete rewrite", there aren't yet enough tools out there to make this task anything less than enormous. The existing application needs to be reverse-engineered, optimized and cleaned, re-architected, and then finally re-generated for the new OpenEdge framework (OO, OERA, etc). There are plenty of techniques which already exist for optimizing and cleaning, but most of those depend on the existence of a suitable Intermediate Representation.


Statements and Functions

This section is speculative, and subject to change.

Here, for discussion sake, I'd like to describe the semantics of an OpenEdge application in terms of two broad categories.

The first is basic expressions and control flow. These are things that can be represented by any language, and includes looping and conditional branching. These will be represented explicitly in our IR.

The second is platform runtime support. Most statements and functions in 4GL/ABL are not simple expressions or control flow, but instead are direct references to the OpenEdge platform runtime support. Those statements and functions will likely be represented in our IR by calls to stubs, that is, calls to unimplemented subroutines.

To some extent, the two are intertwined. Consider the FOR EACH construct, which provides basic expression evaluation and looping, but also uses platform runtime support for buffer iteration over rows in tables. In cases such as these, our tree transform will have to tease the two apart, so that basic expressions and control flow are represented explicitly, and the use of platform runtime support is represented by calls to subroutines.

Statements in 4GL/ABL can be enormous. In OpenEdge itself, the statement is compiled to an AST, and that AST is interpreted by the runtime. If, in our IR, we treat most statements as calls to subroutines, it is tricky to come up with an appropriate representation for passing all of the data contained in the statement's AST branch to the subroutine. There needs to be a happy medium between implementing the semantics of the subroutine itself, which we don't want to do, and just passing the entire AST as an argument to the subroutine stub, which would defeat our goal of using the IR for data flow analysis. Update: I'm pretty sure I've come up with a good way to handle this. More to come soon!