<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE article PUBLIC "-//OASIS//DTD DocBook XML V4.2//EN" "http://www.oasis-open.org/docbook/xml/4.2/docbookx.dtd">
<article id="ANTLR">
  <articleinfo>
    <title>ANTLR</title>
    <author>
      <firstname>Ashley</firstname>
      <othername>J.S</othername>
      <surname>Mills</surname>
      <affiliation>
 <address><email>ashley@ashleymills.com</email></address>
      </affiliation>
    </author>

    <copyright>
      <year>2005</year> 
      <holder role="mailto:ashley@ashleymills.com">The University Of Birmingham</holder>
    </copyright>
  </articleinfo>

  <sect1 id="ANTLR-Introduction"><title>Introduction</title>
    <para>
      ANTLR (ANother Tool for Language Recognition) is a parser and translator generator tool that lets one define language grammars in either ANTLR syntax (which is YACC and EBNF(Extended Backus-Naur Form) like) or a special AST(Abstract Syntax Tree) syntax.  ANTLR can create lexers, parsers and AST's.  ANTLR is more than just a grammar definition language however, the tools provided allow one to implement the ANTLR defined grammar by automatically generating lexers and parsers (and tree parsers) in either Java (<ulink url="http://java.sun.com/">http://java.sun.com/</ulink>, C++ (<ulink url="http://anubis.dkuug.dk/jtc1/sc22/wg21/">http://anubis.dkuug.dk/jtc1/sc22/wg21/</ulink> or Sather (<ulink url="http://www.icsi.berkeley.edu/~sather/">http://www.icsi.berkeley.edu/~sather/</ulink>.
    </para>

    <para>
      ANTLR implements a PRED-LL(k) parsing strategy and affords arbitrary lookahead for disambiguating the ambiguous. An answer to the question &quot;What is ANTLR?&quot; by Terrance Parr the creator of ANTLR can be found here: <ulink url="http://www.jguru.com/faq/view.jsp?EID=77">http://www.jguru.com/faq/view.jsp?EID=77</ulink>
    </para>
  </sect1>

  <sect1 id="ANTLR-Background-Information"><title>Background Information</title>
    <para>
      ANTLR is a compiler tool hence it's developer base is generally constrained to those whom desire to create translators of some kind. In order to comprehend much of what will be discussed in this tutorial it is necessary to first get a feel of the terminology used in this area of computer science and the basic concepts behind the operation of ANTLR. This section will begin with a brief discussion of how a compiler operates.
    </para>

    <sect2 id="ANTLR-Background-Information-Lexer"><title>The Lexer</title>

      <para><emphasis role="strong">Other names: Scanner, lexical analyser, tokenizer.</emphasis></para>

      <para>
        Programming languages are made up of keywords, and strictly defined constructs, the ultimate aim of the compilation process is to translate the high level instructions of the programming language into the low-level instructions of the machine or virtual machine that is the intended execution architecture.  For example, a native C++ compiler compiles C++ code into machine language instructions that execute directly on the target hardware (or on some simulation of the target hardware), the standard Java compiler distributed by Sun Microsystems compiles Java source code to Java bytecode which is the machine language instruction set used by the Java virtual machine, this bytecode can then be executed by any platform that implements the Java virtual machine.
      </para>

      <para>
        A source program is written using some kind of editing tool that can produce a file which is comprised of statements and constructs that are allowed in the programming language being used.  The actual text of the file is written using characters of a particular character set or subset of some character set, so a source file can be thought of as a stream of characters terminated by some EOF (End Of File) marker that signifies the end of the source file.
      </para>

      <para>
        A source file is streamed to a lexer on a character by character basis by some kind of input interface.  The lexers job is to quantify the meaningless stream of characters into discrete groups that, when processed by the parser, have meaning. Each character or group of characters quantified in this manner is called a token. Tokens are components of the programming language in question such as keywords, identifiers, symbols, and operators.  (Usually)The lexer removes comments and whitespace from the program, and any other content that is not of semantic value to the interpretation of the program.  The lexer converts the stream of characters into a stream of tokens which have individual meaning as dictated by the lexer's rules. Similarly, your brain is probably grouping the individual characters that make up each of the words in this sentence into tokens (words in this case, which have semantic value to you), your job of determining where one token finishes and another begins is made a little easier however, because the words in a sentence are already separated by spaces, it could be argued that an English sentence is already tokenised in this sense, however, we can assume that some kind of grouping and recognition is occurring at the word level too. The stream of tokens generated by the lexer are received by the parser.
      </para>

      <para>
        A lexer usually generates errors pertaining to sequences of characters it cannot match to a specific token type defined by one of it's rules.
      </para>
    </sect2>

    <sect2 id="ANTLR-Background-Information-Parser"><title>The Parser</title>

      <para>
        Other Names: Syntactical analyser.
      </para>

      <para>
        A lexer groups sequences of characters it recognises in the character stream into tokens with individual semantic worth, it does not consider their semantic worth within the context of the whole program, this is the job of the parser.  Languages are described by a grammar, the grammar determines exactly what defines a particular token and what sequences of tokens are decreed as valid. The parser organises the tokens it receives into the allowed sequences defined by the grammar of the language. If the language is being used exactly as is defined in the grammar, the parser will be able to recognise the patterns that make up certain structures and group these together. If the parser encounters a sequence of tokens that match none of the allowed sequences of tokens, it will issue an error and perhaps try to recover from the error by making a few assumptions about what the error was.
      </para>

      <para>
        The parser checks to see if the tokens conform to the syntax of the language defined by the grammar. Similarly your brain knows what kinds of sentences are valid within a particular language such as English and it could be said that at this moment in time your brain is parsing the words of this sentence and grouping them into what you understand as valid sequences, for instance, your brain knows that a sentence ends when a full stop is encountered, one would not assume that the text following the full stop was part of the same sentence. In addition to this your brain is also extracting meaningful information from the sentence.  Usually the parser will convert the sequences of tokens that it has been deliberately created to match into some other form such as an Abstract Syntax Tree (AST). An AST is easier to translate to a target language because an AST contains additional information implicitly, by nature of it's structure. Effectively, creating an AST is the most important part of a language translation process.
      </para>
        
      <para>
        The parser generates one or more symbol table(s) which contain information regarding tokens it encounters, such as whether or not the token is the name of a procedure or if it had some specific value, the symbol tables are used in the generation of object code and in type checking, for example, so that an integer cannot be assigned to a string or whatever. ANTLR uses symbol tables to speed up the matching of tokens, in that an integer is mapped to a particular token, then instead of matching the string that would compose a textual description of that token, the integer that represents that token is matched instead, which is a lot quicker. Eventually the AST will be translated to an executable format, some linking of libraries may be performed, this is not considered the job of the compiler and is not of direct concern here.
      </para>

      <para>
        A parser usually generates errors pertaining to sequences of tokens it cannot match to the specific syntactical arrangements allowed, as decreed by the grammar.
      </para>

      <para>
        Both lexers and parsers are recognizers, lexers recognize sequences of characters, parsers recognize sequences of Tokens.  A lexer or a parser converts a stream of elements (be they characters or tokens) and translates them to some other stream of elements such as tokens representing larger structures or groups of elements or perhaps nodes in an abstract syntax tree. They are essentially the same thing, however, traditionally lexers are associated with processing streams of characters and parsers are associated with processing streams of Tokens.
      </para>

      <para>
        It is recommended that you read <citetitle>Building Recognizers By Hand</citetitle> by Terrance Parr the creator of ANTLR, it can be found here <ulink url="http://www.antlr.org/book/byhand.pdf">http://www.antlr.org/book/byhand.pdf</ulink>, to get an insight into how one would go about creating a recogniser in Java, and from this you can abstract how it can be done in any programming language. When you have the time you should read all the documentation that came with the ANTLR installation.
      </para>
    </sect2>

    <sect2 id="ANTLR-Background-Information-AndANTLR"><title>What is ANTLR's part in all this?</title>
      <para>
        ANTLR lets you define the rules that the lexer should use to tokenize a stream of characters and the rules the parser should use to interpret a stream of tokens. ANTLR can then generate a lexer and a parser which you can use to interpret programs written in your language and translate them other languages and AST's.  The design of ANTLR affords much extensibility and it has many applications. <!-- FLESH THIS OUT --> 
      </para>
    </sect2>
  </sect1>

  <sect1 id="ANTLR-Installation"><title>Installation</title>
    <para>
      The documentation for the installation is written under the assumption that the reader has some experience of installing software on computers and knows how to change the operating environment of the particular operating system they are using. The documents entitled <ulink url="../winenvars/winenvarshome.html"><citetitle>Configuring A Windows Working Environment</citetitle></ulink> and <ulink url="../unixenvars/unixenvarshome.html"><citetitle>Configuring A Unix Working Environment</citetitle></ulink> are of use to people who need to know more.
    </para>

    <orderedlist>
      <listitem>
        <para>
          Obtain the ANTLR download by following the download section links at <ulink url="http://www.antlr.org/">http://www.antlr.org/</ulink>
        </para>
      </listitem>
      <listitem>
        <para>
          Unzip to a suitable location.
        </para>
      </listitem>
      <listitem>
        <para>
          Add, <filename>/path/to/where/you/unzipped/ANTLR/antlr.jar</filename> and <filename>/path/to/where/you/unzipped/ANTLR</filename> to your classpath, do not include a trailing slash after the directory name otherwise you may encounter problems.
        </para>
      </listitem>
    </orderedlist>
  </sect1>

  <sect1 id="ANTLR-Grammar-Template"><title>ANTLR Grammar Template</title>
    <para>
      An ANTLR grammar file has a number of components, some of which are optional and some of which are mandatory, the 'template' below shows all the components that make up an ANTLR grammar file and then briefly describes them, do not expect to comprehend this instantly, things will become clearer the further you progress into this document. And then you may use this template to remind yourself what is allowed where within an ANTLR grammar.
    </para>

    <programlisting>
header {
  // stuff that is placed at the top of &lt;all&gt; generated files
}        <co id="ANTLR-Grammar-Template-Header"/>

options { options for entire grammar file }

{ optional class preamble - output to generated classfile
immediately before the definition of the class }
class YourLexerClass extends Lexer; 
// definition extends from here to next class definition 
// (or EOF if no more class defs)
options { YourOptions }
tokens...<co id="ANTLR-Grammar-Template-Lexer"/>
lexer rules...
myrule[args] returns [retval]
   options { defaultErrorHandler=false; }
   :   // body of rule...
   ;    

{ optional class preamble - output to generated classfile
immediately before the definition of the class }
class YourParserClass extends Parser;
options { YourOptions }
tokens...
parser rules...   

{ optional class preamble - output to generated classfile
immediately before the definition of the class }
class YourTreeParserClass extends TreeParser;
options { YourOptions }
tokens...
tree parser rules...  

// arbitrary lexers, parsers and treeparsers may be included
    </programlisting>

    <calloutlist>
      <callout arearefs="ANTLR-Grammar-Template-Header">
        <para/>
        <programlisting>
header {
  // stuff that is placed at the top of &lt;all&gt; generated files
}          
        </programlisting>

        <para>
          The textual content of the header will be copied verbatim to the top of all files generated when ANTLR is ran on the grammar.
        </para>
        <programlisting>
        </programlisting>
      </callout>

      <callout arearefs="ANTLR-Grammar-Template-Lexer">
        <para/>
        <programlisting>
{ optional class preamble - output to generated classfile 
          immediately before the definition of the class }
class YourLexerClass extends Lexer; 
// definition extends from here to next class definition 
// (or EOF if no more class defs)
options { YourOptions }
tokens...
lexer rules...     
        </programlisting>

        <para>
          This begins with an optional class preamble, any text placed here will be copied verbatim to the top of the class the statement prefixes. The options section contains options specific to this class, for example:
        </para>

        <programlisting>
options {
   k = 2;                       // Set token lookahead to two
   tokenVocabulary = Befunge;   // Call it's vocabulary &quot;Befunge&quot;
   defaultErrorHandler = false; // Don't generate parser error handlers
}
        </programlisting>

        <para>
          Extensive detail pertaining to the various options available can be found in the ANTLR documentation that is installed with everything else, i.e <filename>/path/to/where/you/installed/ANTLR/doc/options</filename>. The tokens section lets you explicitly define literals and imaginary tokens, e.g:
        </para>

        <programlisting>
tokens {
   EXPR;          // Imaginary token
   THIS=&quot;that&quot;;   // Literal definition
   INT=&quot;int&quot;;     // Literal definition
}
        </programlisting>

        <para>The lexer rules come next and have the general form:</para>

        <programlisting>
rulename [args] returns [retval]
   options { defaultErrorHandler=false; }
   { optional initiation code }
   :   alternative_1
   |   alternative_2
   ...
   |   alternative_n
   ;    
        </programlisting>

        <para>For example:</para>

        <programlisting>
INT    : ('0'..'9')+; // Matches an integer
        </programlisting>

        <para>
          There can be an arbitrary number of rules.
        </para>
      </callout>
    </calloutlist>

    <para>
      The other two sections have the same layout as the one described. There can be zero or more lexers, parsers and treeparsers and they can come in any order. The scope of the a class is defined as extending from the class's declaration to the next class declaration, or if it is the last class declaration, to the end of the file.
    </para>
  </sect1>

  <sect1 id="ANTLR-Notation"><title>ANTLR Notation</title>
    <para>
      ANTLR specifies it's lexical rules and parser rules using the almost exactly the same notation, ANTLR notation is based on YACC's notation and there are some EBNF constructs thrown in for good measure.  A rule is simply a sequence of instructions which describe a particular pattern that ANTLR should match.
    </para>

    <sect2 id="ANTLR-Notation-ZOM"><title>Zero Or More</title>
      <para>
        ANTLR uses the notation <emphasis role="strong">(expression)*</emphasis> to indicate that the pattern matching expression specified inside the parentheses must be matched zero or more times.
      </para>
    </sect2>

    <sect2 id="ANTLR-Notation-OOM"><title>One Or More</title>
      <para>
        ANTLR uses the notation <emphasis role="strong">(expression)+</emphasis> to indicate that the pattern matching expression specified inside the parentheses must be matched one or more times.
      </para>
    </sect2>

    <sect2 id="ANTLR-Notation-OPT"><title>Optional</title>
      <para>
        ANTLR uses the notation <emphasis role="strong">(expression)?</emphasis> to indicate that the pattern matching expression specified inside the parentheses must be matched zero or one times, in other words, it's optional. 
      </para>
    </sect2>
  </sect1>

  <sect1 id="ANTLR-LexerExample"><title>Lexer Example</title>
    <para>
      This example illustrates a very simple lexer that matches alpha and numeric strings. It is available to download here: <ulink url="files/simple.g">simple.g</ulink>.
    </para>

    <programlisting>
class SimpleLexer extends Lexer;

options { k=1; filter=true; }

ALPHA   : ('a'..'z'|'A'..'Z')+
        { System.out.println(&quot;Found alpha: &quot;+getText()); }
        ;

NUMERIC : ('0'..'9')+
        { System.out.println(&quot;Found numeric: &quot;+getText()); }
        ;

EXIT    : '.' { System.exit(0); } ;      
    </programlisting>

    <para>This will be explained in sections, lets begin with the first line:</para>

    <programlisting>
class SimpleLexer extends Lexer;      
    </programlisting>

    <para>
      This is the lexer declaration, pretty straightforward, it's scope is from the line shown to the next class declaration or, if there are no more class declarations, until the end of the file.
    </para>

    <programlisting>
options { k=1; filter=true; }
    </programlisting>

    <para>
      Here some basic options are set. <emphasis role="strong">k</emphasis> is set to one, <emphasis role="strong">k</emphasis> is the lookahead value.  For example, with a lookahead of one, ANTLR would not be able to tell the difference between:
    </para>

    <programlisting>
      SILLY1 : &quot;ab&quot; ;
      SILLY2 : &quot;ac&quot; ;
    </programlisting>

    <para>
      And when trying to parse a file containing these lexer rules, ANTLR will issue the error: 
    </para>

    <screen>
warning: lexical nondeterminism between rules SILLY1 and SILLY2 upon
silly.g:0:  k==1:'a'
    </screen>

    <para>
      Because if the lexer encountered &quot;ab&quot;, with a lookahead of one, it would get confused as to whether it should match the rule SILLY1 or the rule SILLY2 since they both begin with 'a'. With k=2, that is, a lookahead of two, the lexer will not only compare the first character but also the second character and hence will be able to disambiguate between the two cases.  There are cases when increasing the lookahead does not work or when it is not efficient, these cases will be discussed in another section. Interestingly, if you actually implement this example with k=1, it will match &quot;ab&quot; but not &quot;ac&quot; because ANTLR matches whichever of the ambiguous rules are defined first.
    </para>

    <para>
      The <emphasis role="strong">filter=true</emphasis> option sets filtering on, which means that the lexer will ignore all input that does not exactly match one of the non-protected rules (protected rules can only be called by other rules). It can also be set to a protected rule name to assign handling of non-matches to a particular rule, for example:
    </para>

    <programlisting>
options { filter=BLAH; }
protected BLAH : { _ttype = Token.SKIP; } ;
    </programlisting>

    <para>
      Has the same effect as setting filter to true. However, using the filter as I have just shown is <emphasis>not</emphasis> recommended as this kind of use would be entirely redundant.
    </para>

    <programlisting>
ALPHA   : ('a'..'z'|'A'..'Z')+
        { System.out.println(&quot;Found alpha: &quot;+getText()); }
        ;
    </programlisting>

    <para>
      This is an example of a lexer rule, it matches any sequence of one or more characters indicated by the ranges specified.  The '+' means match one or more of whatever was specified in the group preceding it, in this case either anything within the range of characters <emphasis role="strong">'a'..'z'</emphasis> or <emphasis role="strong">'A'..'Z'</emphasis>, the ranges are separated by a '|' character which signifies a logical OR.  An action has also been specified for this match which will only be executed upon a successful match, the action is simply a print statement that indicates that an alpha type token has been found and then the token text is printed with a call to <emphasis role="strong">getText()</emphasis> which returns the encapsulating tokens text data.
    </para>

    <programlisting>
NUMERIC : ('0'..'9')+
        { System.out.println(&quot;Found numeric: &quot;+getText()); }
        ;      
    </programlisting>

    <para>
      This is very similar to the <emphasis role="strong">ALPHA</emphasis> rule but instead matches sequences of one or more characters within the range <emphasis role="strong">'0'..'9'</emphasis>, if a match occurs it prints a message indicating it has found a numeric token type and then prints the token's text.
    </para>

    <programlisting>
EXIT    : '.' { System.exit(0); } ;
    </programlisting>

    <para>
      This matches the literal character '.', (when unquoted '.' can be within rules; as a wild-card that will &quot;match any character&quot; encountered.  The action performed when it is matched is to exit the program with an exit status of zero signifying that the program exited normally.
    </para>

    <note>
      <para>
        When using the <emphasis role="strong">filter=true</emphasis> option, if one is processing actual files then one should always include a rule to match newlines, because the lexer needs to be told that a newline has occurred in order to increment the line count, otherwise the lexer would be stuck on one line!  A typical newline rule, showing the call to the <emphasis role="strong">newline()</emphasis> method to handle the newlines correctly is shown below:
      </para>

      <programlisting>
NEWLINE   :  ( &quot;\r\n&quot; // DOS
               | '\r'   // MAC
               | '\n'   // Unix
             )
             { newline(); 
               $setType(Token.SKIP);
             }
          ;
      </programlisting>

      <para>
        The ANTLR directive <emphasis role="strong">$setType(tokenType)</emphasis> is used to indicate that these sequences of characters should be ignored.
      </para>
    </note>

    <para>After defining our lexer, ANTLR s ran on the source file to generate the various Java (or C++ or Sather) files:</para>

    <screen><userinput><command>java</command> antlr.Tool <filename>simple.g</filename></userinput></screen>

    <para>
      The command displays the version information and then, if no errors are present in the source file, generates the output files.  If errors are present in the source file and are detected then nice messages are displayed describing them.  If the errors are not fatal, the files will still be produced and usually some kind of output will be produced, it's just that the output files may not compile if errors were encountered.  The text produced when error are encountered can be quite informative such as this one, generated when I deliberately omitted, the rule terminating ';' character, from the end of a rule:
    </para>

    <screen>
ANTLR Parser Generator   Version 2.7.1   1989-2000 jGuru.com
simple.g:13: warning:did you forget to terminate previous rule?
    </screen>

    <para>
      The line number indicated is the start of the rule that follows the one I forgot to terminate. In this case, ANTLR recovered from this error and the output files still worked as desired.  The output files produced are <filename>SimpleLexer.java</filename> and <filename>SimpleLexerTokenTypes.java</filename>.
    </para>

    <para>
      <filename>SimpleLexer.java</filename>, the lexer produced implements TokenStream which means that it will return the next token in the token stream when somebody calls <emphasis role="strong">nextToken()</emphasis> from an instantiation of SimpleLexer (or whatever the lexer's name is). It contains methods which are discussed in more detail in the ANTLR documentation.
    </para>

    <para>
      <filename>SimpleLexerTokenTypes.java</filename> contains the token type definitions defined as integer constants, for efficient comparison at runtime. Token tables are also used for type checking before translation. In order to use the lexer some kind of program must invoke it, here is a Java program which does just that:
    </para>

    <programlisting>
import java.io.*;
public class Main {
   public static void main(String[] args) {
      SimpleLexer simpleLexer = new SimpleLexer(System.in);
      while(true) {
         try {
           simpleLexer.nextToken();
         } catch(Exception e) {}
      }
   }
}
    </programlisting>

    <para>
      A new instance of <emphasis role="strong">SimpleLexer</emphasis> is created with the constructor utilising the <emphasis>SimpleLexer(InputStream in)</emphasis> constructor that is automatically generated. An infinite loop is then entered which keeps grabbing the next token from the input stream.  The input stream is <emphasis role="strong">System.in</emphasis> which means that an input interface will be presented on the command line and input will be passed to the lexer every time the enter key is hit (hence no need for newline handling).
    </para>

    <para>All of the java files are compiled by issuing:</para>
    
    <screen><userinput><command>javac</command> <filename>*.java</filename></userinput></screen>
    
    <para>The <filename>Main.class</filename> produced is executed by issuing:</para>
    
    <screen><userinput><command>java</command> <filename>Main</filename></userinput></screen>
    
    <para>An example session using this Lexer is shown below:</para>

    <screen>
<userinput>This Lexer recognises strings and numbers: hello 22 goodbye 33</userinput>
Found alpha: This
Found alpha: Lexer
Found alpha: recognises
Found alpha: strings
Found alpha: and
Found alpha: numbers
Found alpha: hello
Found numeric: 22
Found alpha: goodbye
Found numeric: 33
<userinput>It ignores everything else: -=+/#</userinput>
Found alpha: It
Found alpha: ignores
Found alpha: everything
Found alpha: else
<userinput>.</userinput>
    </screen>

    <para>
      This Lexer exclusively recognises alpha and numeric content, if it is passed the string &quot;11aa33hi&quot; it will not treat it as a single string but will break up the alpha and numeric parts, as it was specified to do:
    </para>

    <screen>
<userinput>11aa33hi</userinput>
Found numeric: 11
Found alpha: aa
Found numeric: 33
Found alpha: hi
<userinput>.</userinput>
    </screen>

    <para>
      That about wraps up this Lexer introduction but it should be noted that usually a Lexer is used in combination with a Parser, this is example is totally contrived to illustrate some of the concepts. You should consult the ANTLR documentation and check out the other examples in this text for a more thorough comprehension of the Lexers part in the translation process.
    </para>
  </sect1>

  <sect1 id="ANTLR-Simple-Example"><title>Simple Lexer/Parser Example</title>
    <para>
      A simple lexer and parser will be discussed.  The Job of the lexer will be to tokenise the input stream into the tokens; <emphasis role="strong">NAME</emphasis>, <emphasis role="strong">AGE</emphasis>, <emphasis role="strong">DOB</emphasis>(Date Of Birth) and <emphasis role="strong">SEMI</emphasis>(semicolon). The parsers job will be to recognise the sequence <emphasis role="strong">DOB NAME AGE(SEMI)</emphasis> and output this as <emphasis role="strong">Name: name, Age: nn, DOB nn/nn/nn</emphasis>. The lexer and parser will be defined in one file <ulink url="files/simple2.g">simple2.g</ulink>:
    </para>

    <programlisting>
class SimpleParser extends Parser;

entry : (d:DOB n:NAME a:AGE(SEMI)
      { 
        System.out.println(
          &quot;Name: &quot;    + 
          n.getText() +
          &quot;, Age: &quot;   +
          a.getText() + 
          &quot;, DOB: &quot;   +
          d.getText()
        );
      })*
      ;

class SimpleLexer extends Lexer;

NAME : ('a'..'z'|'A'..'Z')+;

DOB  : ('0'..'9' '0'..'9' '/')=&gt; 
       (('0'..'9')('0'..'9')'/')(('0'..'9')('0'..'9')'/')('0'..'9')('0'..'9')
     | ('0'..'9')+      { $setType(AGE); } ;

WS     :
    (' ' 
    | '\t' 
    | '\r' '\n' { newline(); } 
    | '\n'      { newline(); }
    ) 
    { $setType(Token.SKIP); } 
  ;

SEMI : ';' ;
    </programlisting>

    <para>
      The parser is the first class to be specified, however, order does not matter. It is considered better practice by some to start at the most abstract level possible and then work toward the bottom, i.e top down.
    </para>

    <programlisting>
entry : (d:DOB n:NAME a:AGE(SEMI)
      { 
        System.out.println(
          &quot;Name: &quot;    + 
          n.getText() +
          &quot;, Age: &quot;   +
          a.getText() + 
          &quot;, DOB: &quot;   +
          d.getText()
        );
      })*
      ;      
    </programlisting>

    <para>
      You can see that the first rule is <emphasis>entry</emphasis>, this is the rule which will be called from the <emphasis>main</emphasis> method to start the parsing the input.  It says that the parser should look for a <emphasis>DOB</emphasis> token followed by a space followed by a <emphasis>NAME</emphasis> token, followed by a space followed by an <emphasis>AGE</emphasis> token and terminated with a <emphasis>SEMI</emphasis> token which is a semicolon (the reason for <emphasis>SEMI</emphasis> being in brackets is so that it can be placed immediately after <emphasis>AGE</emphasis> without ANTLR thinking we are trying to reference some token called <emphasis>AGESEMI</emphasis>, it is looking for:
    </para>

    <programlisting>
DOB NAME AGE; 
    </programlisting>

    <para>
      If it is successful in finding this sequence of tokens, the variables indicated immediately before the token names (<emphasis role="strong">a</emphasis> in <emphasis role="strong">a:AGE</emphasis> etc), will take on the values of the tokens they prefix then an action will be performed, this is indicated by the opening brace, the intended action is to print &quot;Name: name, Age: nn, DOB: nn/nn/nn&quot; where &quot;n&quot; signifies some digit.  Notice the scope of the opening parentheses, it's matching parentheses occurs immediately after the closing brace of the action section.  It is postfixed with a '*' meaning that this sequence should be matched zero or more times. Let's take a look at the definitions of these tokens in the Lexer:
    </para>

    <programlisting>
NAME : ('a'..'z'|'A'..'Z')+;
    </programlisting>

    <para>A simple sequence of one or more lower-case <emphasis role="strong">OR</emphasis> upper-case letters.</para>

    <programlisting>
DOB  : ('0'..'9' '0'..'9' '/')=&gt; 
       (('0'..'9')('0'..'9')'/')(('0'..'9')('0'..'9')'/')('0'..'9')('0'..'9')
     | ('0'..'9')+      { $setType(AGE); } ;
   </programlisting>

   <para>If <emphasis>DOB</emphasis> and <emphasis>AGE</emphasis> had been specified like this:</para>

   <programlisting>
DOB : ('0'..'9')('0'..'9')'/' ('0'..'9')('0'..'9')'/' ('0'..'9')('0'..'9') ;

AGE : ('0'..'9')+ ;
   </programlisting>

   <para>ANTLR would have issued the warning:</para>

   <screen>
warning: lexical nondeterminism between rules DOB and AGE upon
simple2.g:0:  k==1:'0'..'9'
   </screen>

   <para>
     This is because both of the rules mentioned begin with ('0'..'9'), this means (with a lookahead of one), if the lexer encountered a digit, it would not know which of the statements to try so It would have to try the first one and then <emphasis>AGE</emphasis> would never be checked and the parser could never find the sequence it was looking for.
   </para>

   <para>
     One way around this would be to specify a parser lookahead of 3, that is, include an options statement for the parser like:
   </para>

   <programlisting>
options { k=3; }
   </programlisting>

   <para>
     This would enable the parser to look forward a maximum of three where it is necessary to disambiguate and it would see that if it encountered two digits followed by a forward slash then it should predict the <emphasis>DOB</emphasis> token route and if it does not match that third lookahead value with a forward slash to choose the <emphasis>AGE</emphasis> route.
   </para>

   <para>
     If there are a lot of alternatives all that can be distinguished by changing the lookahead then it is preferable to use this method of increasing the lookahead value but if there is only one or very few alternatives then it is preferable to use a syntactic predicate instead which is what is used in the example. The syntactic predicate in the example is:
   </para>

   <programlisting>
DOB  : ('0'..'9' '0'..'9' '/')=&gt;
   </programlisting>

   <para>
     This says that the lexer should see if it can find two digits followed by a a forward slash, if it can then the lexer should go on to try to  match the sequence specified:
   </para>

   <programlisting>
(('0'..'9')('0'..'9')'/')(('0'..'9')('0'..'9')'/')('0'..'9')('0'..'9')
   </programlisting>

   <para>
     If it is successful in doing so then the token will be matched as <emphasis>DOB</emphasis>, the original rule.  If the syntactic predicate fails and it does not match two digits followed by a forward slash then the lexer should try to match the rule specified as the first alternative, after the '|':
   </para>

   <programlisting>
| ('0'..'9')+      { $setType(AGE); } ;     
   </programlisting>

   <para>
     The implication is that if the rule is matched then the action specified after will be executed, which is a call to the <emphasis role="strong">$setType(type)</emphasis> ANTLR directive, note that this is not a Java statement but an ANTLR one and is prefixed with a '$'. It has been shown how a syntactic predicate can disambiguate between two rules that begin with the same characters.
   </para>

   <programlisting>
WS     :
    (' ' 
    | '\t' 
    | '\r' '\n' { newline(); } 
    | '\n'      { newline(); }
    ) 
    { $setType(Token.SKIP); } 
  ;
   </programlisting>

   <para>
     The <emphasis role="strong">WS</emphasis> rule matches white space, hence &quot;WS&quot;. The rule will match a space (' ') <emphasis role="strong">OR</emphasis> a tab ('\t') <emphasis role="strong">OR</emphasis> a carriage return followed by newline (DOS NEWLINE) ('\r' '\n') <emphasis role="strong">OR</emphasis> a Unix newline ('\n') delimited via a single newline character, notice that the DOS and Unix newline alternatives have actions associated with them, these actions are calls to the <emphasis role="strong">newline()</emphasis> method which tells ANTLR to bump up it's line count and goto the next line.  Without this the lexer would be stuck on one line!  All the alternatives are grouped together within parenthesis and they have an overall action associated with them, this is to set the token type to the ANTLR special type <emphasis>Token.SKIP</emphasis> which causes the tokens to be ignored.
   </para>

    <programlisting>
SEMI : ';' ;     
    </programlisting>

    <para>This simply matches a semicolon (';').</para>

    <para>
      The Lexer and Parser are setup using a main method, this could have been included as a block of code within the Parser block, (or lexer block) in the form of a <emphasis>static main</emphasis> method so that there would be no need to bother writing an extra class. For the sake of clarity, an extra class will be used here:
    </para>

    <programlisting>
import java.io.*;
public class Main {
  public static void main(String args[]) {
    DataInputStream input = new DataInputStream(System.in);
    SimpleLexer lexer = new SimpleLexer(input);
    SimpleParser parser = new SimpleParser(lexer);
    try {
      parser.entry();
    } catch(Exception e) {}
  }
}
    </programlisting>

    <para>
      Pretty straightforward, instantiation of an input-stream which is passed to the constructor of the lexer and then the lexer is passed to the constructor of the parser.  Once the parser has been created, the <emphasis>entry</emphasis> method defined in our parser that does all the work is invoked.  This is oriented toward receiving input from a file via redirection but one can just as happily input via the prompt that is produced if the main method is invoked without redirecting some input to it.  Here is the input file, <filename>test.txt</filename>
    </para>

    <programlisting>
06/06/82 Peter 20;
03/04/83 Rosie 19;
04/05/81 Mikey 21;
    </programlisting>

    <para>After creating the classes with:</para>

    <screen><userinput><command>java</command> antlr.Tool <filename>Simple2.g</filename></userinput></screen>

    <para>And compiling everything:</para>

    <screen><userinput><command>java</command> *.java</userinput></screen>
        
    <para>Main is invoked:</para>

    <screen><userinput><command>java</command> Main &lt; <filename>test.txt</filename></userinput></screen>

    <para>This produces the output:</para>

    <screen>
Name: Peter, Age: 20, DOB: 06/06/82
Name: Rosie, Age: 19, DOB: 03/04/83
Name: Mikey, Age: 21, DOB: 04/05/81
    </screen>

    <para>An error can be simulated by changing Mikey's age to &quot;a21&quot;, which produces the output:</para>

    <screen>
Name: Peter, Age: 20, DOB: 06/06/82
Name: Rosie, Age: 19, DOB: 03/04/83
line 3: expecting AGE, found 'a'
    </screen>
  </sect1>

  <sect1 id="ANTLR-ExpressionEval"><title>Expression Evaluation Example</title>
    <para>
      It has come to that time where it is necessary to step into the obligatory expression evaluator example.  The expression evaluator will start off simple, more advanced features will then be added. The initial ANTLR grammar for the expression evaluator, <ulink url="files/expression.g">expression.g</ulink> is shown below:
    </para>

    <programlisting>
class ExpressionParser extends Parser;
options { buildAST=true; }

expr     : sumExpr SEMI!;
sumExpr  : prodExpr ((PLUS^|MINUS^) prodExpr)* ; 
prodExpr : powExpr ((MUL^|DIV^|MOD^) powExpr)* ;
powExpr  : atom (POW^ atom)? ;
atom     : INT ;

class ExpressionLexer extends Lexer;

PLUS  : '+' ;
MINUS : '-' ;
MUL   : '*' ;
DIV   : '/' ;
MOD   : '%' ;
POW   : '^' ;
SEMI  : ';' ;
protected DIGIT : '0'..'9' ;
INT   : (DIGIT)+ ;

{import java.lang.Math;}
class ExpressionTreeWalker extends TreeParser;

expr returns [double r]
  { double a,b; r=0; }

  : #(PLUS  a=expr b=expr)  { r=a+b; }
  | #(MINUS a=expr b=expr)  { r=a-b; }
  | #(MUL   a=expr b=expr)  { r=a*b; }
  | #(DIV   a=expr b=expr)  { r=a/b; }
  | #(MOD   a=expr b=expr)  { r=a%b; }
  | #(POW   a=expr b=expr)  { r=Math.pow(a,b); }
  | i:INT { r=(double)Integer.parseInt(i.getText()); }
  ;
    </programlisting>

    <para>
      The grammar definition begins with the parser section, this will be explained step by step.
    </para>

    <programlisting>
class ExpressionParser extends Parser;
options { buildAST=true; }      
    </programlisting>

    <para>
      The class is declared normally and then the option <emphasis>buildAST=true</emphasis> is specified, this signifies that the parser should build an AST(Abstract Syntax Tree) as it parses the input tokens. Special notation will be used to specify how the tree should be build up.
    </para>

    <programlisting>
expr     : sumExpr SEMI!;
sumExpr  : prodExpr ((PLUS^|MINUS^) prodExpr)* ;
    </programlisting>

    <para>
      The top level rule of the expression is <emphasis>expr</emphasis>, this simply references the next element down, indicating that an <emphasis>expr</emphasis> can be a <emphasis>sumExpr</emphasis>, the rule also specifies that an expression is terminated with a <emphasis role="strong">SEMI</emphasis>.  This rule is redundant, in that, the top rule could have been <emphasis>sumExpr</emphasis> since <emphasis>expr</emphasis> references it directly, the addition being <emphasis role="strong">SEMI!</emphasis> (which would have to be appended to <emphasis>prodExpr</emphasis>.  The reason for this is that it makes it clearer that the overall thing being matched is an expression and not a sum expression. The terminating <emphasis role="strong">SEMI</emphasis> is postfixed with a '!', this tells the AST builder not to include this token in the tree (whereupon it would postfix the whole expression).
    </para>

    <para>
      The <emphasis>sumExpr</emphasis> is an expression that consists of a <emphasis>prodExpr</emphasis> followed by zero or more (<emphasis role="strong">PLUS</emphasis> OR <emphasis role="strong">MINUS</emphasis>) <emphasis>prodExpr</emphasis>, sequences, this is so that <emphasis>sumExpr</emphasis> can recognise sequences such as:
    </para>

    <programlisting>
      <emphasis>prodExpr</emphasis><emphasis role="strong">PLUS</emphasis><emphasis>prodExpr</emphasis><emphasis role="strong">MINUS</emphasis><emphasis>prodExpr</emphasis>
    </programlisting>

    <para>
      Because zero or more of these sequences must be present for a match, expressions without <emphasis role="strong">PLUS</emphasis> and <emphasis role="strong">MINUS</emphasis> can be constructed because an expression can just be a <emphasis>prodExpr</emphasis> without the additions.  This will form a hierarchy whereby due to this kind of optionality (zero or more), an expression can just be an <emphasis>atom</emphasis> which is an <emphasis role="strong">INT</emphasis>, this will be explained in more detail later.
    </para>

    <para>
      An AST is a special kind of tree that can have an arbitrary number of subtrees (children) which are ASTs themselves. When walking the tree one can manipulate the order in which nodes are visited with all the expressiveness of the implementation language (Java, C++ or Sather).
    </para>

    <para>
      The <emphasis role="strong">PLUS</emphasis> and <emphasis role="strong">MINUS</emphasis> token references are postfixed with the caret ('^') character. This is an ANTLR directive specific to the creation of AST's, it specifies that the token the caret postfixes should become the root of the current AST or AST subtree. The first caret-postfixed token encountered in an expression being evaluated will become the root of the AST overall. If a caret-postfixed token is encountered whilst evaluating a child of a root, the child element will become a new subtree with a root equal to the caret-postfixed token.
    </para>

    <para>
      The AST structure is defined in such a way that operator precedence is obeyed.   In this case the tree will be constructed such that the a root will represent an operator and the children will represent it's operands. When the tree is parsed it will be parsed in an ordinary fashion, evaluating it's children from left to right recursively. The rule:
    </para>

    <programlisting>
sumExpr  : prodExpr ((PLUS^|MINUS^) prodExpr)* ;
    </programlisting>
      
    <para>
      Dictates that the first and second children of a <emphasis>sumExpr</emphasis> must be <emphasis role="strong">prodExpr</emphasis>'s. It is because of this that the desired precedence is guaranteed. Take a look at the <emphasis>prodExpr</emphasis> rule:
    </para>

    <programlisting>
prodExpr : powExpr ((MUL^|DIV^|MOD^) powExpr)* ;      
    </programlisting>

    <para>
      This says that a <emphasis>prodExpr</emphasis> must consist of a <emphasis>powExpr</emphasis> followed by zero or more (<emphasis role="strong">MUL</emphasis> OR <emphasis role="strong">DIV</emphasis> OR <emphasis role="strong">MOD</emphasis> AND <emphasis>powExpr</emphasis>) sequences.  Multiplication, division and modulo have been grouped together because they have equal precedence.  The caret ('^') is used again to specify that the root of this subtree will be the operator that is specified in the expression, hence the first and second children can only be <emphasis>powExpr</emphasis>'s.
    </para>

    <para> 
      The children of the root will be evaluated before applying them to the root so the <emphasis>powExpr</emphasis>'s will be evaluated first, hence the value of the <emphasis>powExpr</emphasis> will be calculated before applying the operator of the <emphasis>prodExpr</emphasis> which is the desired course of action since a power expression has a higher precedence than a product expression, which in turn has a higher precedence than a sum expression hence the evaluation order of the expression is determined by the structure of the AST created which in turn forces the desired precedences.
    </para>

    <programlisting>
powExpr  : atom (POW^ atom)? ;
atom     : INT ;
    </programlisting>

    <para>
      It can be seen that a <emphasis>powExpr</emphasis> consists of an atom followed by an optional (<emphasis role="strong">POW</emphasis> <emphasis>atom</emphasis>) sequence.  This means that the sequence must occur zero or one times. Why is this rule not defined as zero or more times like this:
    </para>

    <programlisting>
powExpr  : atom (POW^ atom)* ;
    </programlisting>
      
    <para>
      Because this <emphasis>powExpr</emphasis> is broken. What happens if more than one power is specified? The order of evaluation will occur from left to right but it should occur from right to left (amongst the <emphasis role="strong">POW</emphasis>s) because an expression such as <emphasis role="strong">3^2^2</emphasis> should evaluate as <emphasis role="strong">(3^(2^2))</emphasis> but this will be evaluated as <emphasis role="strong">((3^2)^2)</emphasis>, a fix for this will be examined later. By only allowing a maximum of one (<emphasis role="strong">POW</emphasis> <emphasis>atom</emphasis>) sequence, this problem is not encountered but the user is limited to a single level of exponential. Since a multiple exponential is the same as a single exponential equal to the product of the individual exponentials, this should not be a problem.
    </para>
      
    <para>
      The point is that the subtrees will be evaluated first, so the AST is created so that AST subtrees that define operators of higher precedence are only allowed to be children of AST subtrees that define operators of a lower precedence so that the operation of highest precedence is always evaluated first. It can be seen that the final rule is for an <emphasis>atom</emphasis> which is an <emphasis role="strong">INT</emphasis>. So the simplest possible expression would be a single integer.
    </para>

    <note>
      <para>
        You may find it easier to think of the AST defined here as being like a binary tree because each of the operators only has two children, this is more intuitive and perhaps easier to understand.  Effectively, maintaining the analogy, the left and right subtrees of the root will be evaluated before applying them to the root of the tree, that is the tree would be transversed in a postfix manner; left, right, root.
      </para>
    </note>

    <para>Let's take another look at the parser in one block:</para>

    <programlisting>
class ExpressionParser extends Parser;
options { buildAST=true; }

expr     : sumExpr SEMI!;
sumExpr  : prodExpr ((PLUS^|MINUS^) prodExpr)* ; 
prodExpr : powExpr ((MUL^|DIV^|MOD^) powExpr)* ;
powExpr  : atom (POW^ atom)? ;
atom     : INT ;
    </programlisting>

    <para>
      How can an expression be just an atom? Assume we are trying to match the expression <emphasis role="strong">&quot;5;&quot;</emphasis>.  This consists of the tokens <emphasis role="strong">INT</emphasis> and <emphasis role="strong">SEMI</emphasis>. The parser will try to match the token <emphasis role="strong">INT</emphasis> with one of it's rules. It looks at the first rule, <emphasis>expr</emphasis>, and sees the subrule, <emphasis>sumExpr</emphasis>, referenced so it takes a look at that.  The first component of <emphasis>sumExpr</emphasis> is <emphasis>prodExpr</emphasis> and so the parser tries to match the token against <emphasis>prodExpr</emphasis>. The first component of <emphasis>prodExpr</emphasis> is <emphasis>powExpr</emphasis> so the parser tries to match the token against <emphasis>powExpr</emphasis>.
    </para>
      
    <para>
      The first component of <emphasis>powExpr</emphasis> is an <emphasis>atom</emphasis>so the parser tries to match the token against <emphasis>atom</emphasis>, <emphasis>atom</emphasis> is defined as consisting of the token <emphasis role="strong">INT</emphasis> which is what the parser is looking for so <emphasis>atom</emphasis> is matched which means that <emphasis>powExpr</emphasis> is matched, which means that <emphasis>prodExpr</emphasis> is matched which means that <emphasis>sumExpr</emphasis> is matched which means that the first component of <emphasis>expr</emphasis> has been matched. The next token to be matched is <emphasis role="strong">SEMI</emphasis> which matches the second component of <emphasis>expr</emphasis> so <emphasis>expr</emphasis> is matched. Here is a diagram which attempts to illustrate this:
    </para>

    <figure><title><emphasis role="strong">INT</emphasis> <emphasis role="strong">SEMI</emphasis> Match</title>
      <mediaobject>
        <imageobject><imagedata fileref="files/images/int-semi-match.png" format="PNG"/></imageobject>
      </mediaobject>
    </figure>

    <para>
      The red lines (or dark lines if you are in greyscale), are supposed to illustrate failed matches and the line directions show how the order the parser checks the rules.  The green lines are supposed to illustrate correct matches.  The line from <emphasis role="strong">SEMI</emphasis> to <emphasis role="strong">SEMI</emphasis> however is an exception, and is green because I wanted to illustrate that <emphasis role="strong">SEMI</emphasis> was a match, I suppose that the first line from <emphasis role="strong">INT</emphasis> is not really a fail either but if another colour had been used the diagram would have looked odd.
    </para>

    <para>
      Do the rules handle precedence correctly? Let's take a look.  Imagine that the expression <emphasis role="strong">1+2*5;</emphasis>, we know that this should evaluate as <emphasis role="strong">(1+(2*5)</emphasis> and not <emphasis role="strong">((1+2)*5)</emphasis>. This is the sequence (<emphasis role="strong">(INT)(PLUS)(INT)(MUL)(INT)(SEMI)</emphasis>.
    </para>

    <para>The parser can only generate the tree:</para>

    <figure><title>1+(2*5) Binary Tree</title>
      <mediaobject>
        <imageobject><imagedata fileref="files/images/oneplus2x5bintree.png" format="PNG"/></imageobject>
      </mediaobject>
    </figure>

    <para>
      Because a <emphasis>sumExpr</emphasis> cannot be a subtree of a <emphasis>prodExpr</emphasis>, as is defined in the parser grammar. It is because of the defined structure that precedence rules are maintained. Here is the AST version of the tree shown above:
    </para>

    <figure><title>1+(2*5) AST</title>
      <mediaobject>
        <imageobject><imagedata fileref="files/images/oneplus2x5ast.png" format="PNG"/></imageobject>
      </mediaobject>
    </figure>

    <para>
      An AST can have an arbitrary number of children, the children are referred to as &quot;siblings&quot; with respect to each other. One can develop trees such as:
    </para>

    <figure><title>Sum AST</title>
      <mediaobject>
        <imageobject><imagedata fileref="files/images/sumast.png" format="PNG"/></imageobject>
      </mediaobject>
    </figure>

    <para>
      Where <emphasis role="strong">sum</emphasis> is some function that accepts multiple arguments and returns the sum of the arguments, you would of course specify what to do when encountering a particular node of the tree by using a treeparser of some sort, this example has a treeparser (called a treewalker because the tree is walked) which is discussed later. The next section of the grammar is the lexer:
    </para>

    <programlisting>
class ExpressionLexer extends Lexer;

PLUS  : '+' ;
MINUS : '-' ;
MUL   : '*' ;
DIV   : '/' ;
MOD   : '%' ;
POW   : '^' ;
SEMI  : ';' ;
protected DIGIT : '0'..'9' ;
INT   : (DIGIT)+ ;      
    </programlisting>

    <para>
      These rules are pretty self-explanatory and define the tokens used in the parser just described.  <emphasis role="strong">DIGIT</emphasis> is a protected rule, this means it cannot be referenced externally, only by internal rules. It is used in the definition of the <emphasis role="strong">INT</emphasis> rule which says, &quot;one or more digits&quot;.  If protected was not specified ANTLR would generate a nondeterminism error between <emphasis role="strong">DIGIT</emphasis> and <emphasis role="strong">INT</emphasis>.
    </para>

    <para>The next section of the grammar is the tree parser:</para>
      
    <programlisting>
{import java.lang.Math;}
class ExpressionTreeWalker extends TreeParser;

expr returns [double r]
  { double a,b; r=0; }

  : #(PLUS a=expr b=expr)  { r=a+b; }
  | #(MINUS a=expr b=expr) { r=a-b; }
  | #(MUL  a=expr b=expr)  { r=a*b; }
  | #(DIV  a=expr b=expr)  { r=a/b; }
  | #(MOD  a=expr b=expr)  { r=a%b; }
  | #(POW  a=expr b=expr)  { r=Math.pow(a,b); }
  | i:INT { r=(double)Integer.parseInt(i.getText()); }
  ;      
    </programlisting>

    <para>
      The line before the the class declaration is a header that will be pulled into the generated Java file immediately before the class declaration in the generated file.  Whitespace is significant between the opening and closing braces hence no gaps have been left because putting excess spaces there is unnecessary. <emphasis>java.Math</emphasis> is imported so the Java Math classes can be used. Following the class declaration is the <emphasis>expr</emphasis> rule definition that returns the <emphasis role="strong">double</emphasis>, <emphasis>r</emphasis>:
    </para>

    <programlisting>
expr returns [double r]
  { double a,b; r=0; }
    </programlisting>

    <para>
      Immediately preceding the opening of the rule is a Java code section which defines two doubles, a and b, and intialises r to zero.
    </para>

    <programlisting>
  : #(PLUS  a=expr b=expr)  { r=a+b; }
  | #(MINUS a=expr b=expr)  { r=a-b; }
  | #(MUL   a=expr b=expr)  { r=a*b; }
  | #(DIV   a=expr b=expr)  { r=a/b; }
  | #(MOD   a=expr b=expr)  { r=a%b; }
  | #(POW   a=expr b=expr)  { r=Math.pow(a,b); }
  | i:INT { r=(double)Integer.parseInt(i.getText()); }
  ;      
    </programlisting>

    <para>The rule definitions all take use the AST syntax, that is:</para>

    <programlisting>
      #(ROOT child.1 child.2 ... child.n);
    </programlisting>

    <para>
      The first rule says, if <emphasis role="strong">PLUS</emphasis> is found as a root, assign the values of the evaluation of the two child subtrees to the variables '<emphasis role="strong">a</emphasis>' and <emphasis role="strong">b</emphasis> respectively. The rule does not literally say &quot;evaluate the subtrees&quot;, this will happen automatically due to the fact that the rule says to match a tree with <emphasis role="strong">PLUS</emphasis> as a root that has two <emphasis>expr</emphasis>'s as children. It is in the matching of these children, in order to match the whole rule, that the subtrees will be evaluated. The action specified upon a successful match, is to set <emphasis role="strong">r</emphasis> equal to <emphasis role="strong">a</emphasis>+<emphasis role="strong">b</emphasis>.
    </para>
      
    <para>
      In this case, it is obvious with a lookahead of one which rule to match if <emphasis role="strong">PLUS</emphasis> is found as a root because this is the only rule that has <emphasis role="strong">PLUS</emphasis> as the first element.  This may not always be the case, consider adding the ability to specify the sign of a number (as many times as you like, -+-+-5), then the <emphasis role="strong">PLUS</emphasis> and <emphasis role="strong">MINUS</emphasis> tokens would not be used exclusively for the dyadic addition rule but would also occur as the first element of the monadic sign rule.  With a lookahead of one there would be conflicts, this issue is discussed later.
    </para>

    <programlisting>
  | i:INT { r=(double)Integer.parseInt(i.getText()); }
    </programlisting>

    <para>
      This alternative is more interesting than the others so will get it's own special mentioning.  It simply assigns to <emphasis role="strong">r</emphasis>, the value of the <emphasis role="strong">INT</emphasis> found.  This handles the &quot;base case&quot;, as such.
    </para>

    <para>
      The treeparser, is being used to walk the tree and evaluate the expressions entered. Let's look at how this all fits together, here is <ulink url="files/Main.java">Main.java</ulink>: 
    </para>

    <programlisting>
import java.io.*;
import antlr.CommonAST;
import antlr.collections.AST;
import antlr.debug.misc.ASTFrame;
public class Main {
  public static void main(String args[]) {
    try {
      DataInputStream input = new DataInputStream(System.in);

      ExpressionLexer lexer = new ExpressionLexer(input); 

      ExpressionParser parser = new ExpressionParser(lexer);
      parser.expr();

      CommonAST parseTree = (CommonAST)parser.getAST();
      System.out.println(parseTree.toStringList());
      ASTFrame frame = new ASTFrame(&quot;The tree&quot;, parseTree);
      frame.setVisible(true);

      ExpressionTreeWalker walker = new ExpressionTreeWalker();
      double r = walker.expr(parseTree);
      System.out.println(&quot;Value: &quot;+r);
    } catch(Exception e) { System.err.println(&quot;Exception: &quot;+e); }
  }
}
    </programlisting>

    <para>
      First we import all the necessary classes and open the class and <emphasis>static main</emphasis> method as usual, the contents of <emphasis>main</emphasis> are wrapped within a <emphasis>try</emphasis>...<emphasis>catch</emphasis> statement to catch any errors generated, if there are any errors, the error is printed to stdout.
    </para>

    <programlisting>
      DataInputStream input = new DataInputStream(System.in);
    </programlisting>

    <para>A <emphasis>DataInputStream</emphasis> is setup.</para>

    <programlisting>
      ExpressionLexer lexer = new ExpressionLexer(input);
    </programlisting>

    <para>The lexer is created and told to accept data from the inputstream.</para>

    <programlisting>
      ExpressionParser parser = new ExpressionParser(lexer);
      parser.expr();      
    </programlisting>

    <para>
      The parser is created, using the lexer to deliver tokens. The <emphasis>expr()</emphasis> method is called which tells the parser to match an expression as defined by the parser rules.
    </para>

    <programlisting>
CommonAST parseTree = (CommonAST)parser.getAST();      
    </programlisting>

    <para>
      Remember that <emphasis role="strong">options { buildAST=true; }</emphasis> was specified? Here a <emphasis role="strong">CommonAST</emphasis> object is created and assigned a reference to the AST created by the parser's <emphasis>expr</emphasis> rule. It has to be downcast from <emphasis role="strong">collections.AST</emphasis>.
    </para>

    <programlisting>
System.out.println(parseTree.toStringList());
    </programlisting>

    <para>
      The AST is printed using the <emphasis role="strong">CommonAST</emphasis> <emphasis>toStringList()</emphasis> method.
    </para>
    
    <programlisting>
      ASTFrame frame = new ASTFrame(&quot;The tree&quot;, parseTree);
      frame.setVisible(true);
    </programlisting>

    <para>
      A new ASTFrame is created, which is a frame designed for viewing AST's imported from <emphasis>antlr.debug.misc.ASTFrame</emphasis>. The frame is created with the title &quot;The tree&quot; and the <emphasis role="strong">CommonAST</emphasis> object created before. The frame is made visible, this will generate a frame showing the AST.
    </para>

    <programlisting>
      double r = walker.expr(parseTree);
      System.out.println(&quot;Value: &quot;+r);      
    </programlisting>

    <para>
      A new <emphasis role="strong">double</emphasis> is defined and assigned the value of the expression provided by calling the <emphasis>expr</emphasis> rule that was defined for the TreeParser created (which is translated to a Java method). The AST is passed as an argument, finally the value of the expression is printed to stdout.
    </para>

    <para>
      Let's go through an example expression to see what output is generated. Assume that all classes have been generated by running ANTLR on the grammar. The expression:
    </para>

    <programlisting>
1+2-3*4/5^6;
    </programlisting>

    <para>Is placed in a file called <filename>test.txt</filename> and used as input to Main:</para>

    <screen><userinput><command>java</command> Main &lt; <filename>test.txt</filename></userinput></screen>

    <para>The tree expressed as a list via <emphasis>toStringList()</emphasis> is output:</para>

    <screen>
 ( - ( + 1 2 ) ( / ( * 3 4 ) ( ^ 5 6 ) ) ) ;
    </screen>

    <para>The AST comes up in the ASTFrame:</para>

    <figure><title>AST of (1+2-3*4/5^6)</title>
      <mediaobject>
        <imageobject><imagedata fileref="files/images/ast-one.png" format="PNG"/></imageobject>
      </mediaobject>
    </figure>

    <para>And the value is output:</para>

    <screen> 
Value: 2.999232
    </screen>

    <para>
      That concludes the basic expression evaluator, the next section will discuss some extensions. 
    </para>
  </sect1>

  <sect1 id="ANTLR-ExpressionEval-Extensions"><title>Extending The Expression Evaluator</title>
    <sect2 id="ANTLR-ExpressionEval-Extensions-NestedExpressions"><title>Nested Expressions</title>
      <para>
        The old expression evaluator did not allow nesting of brackets, in fact brackets were not mentioned at all. Try to think how you would add nested brackets to the evaluator.  How would you get the evaluator to create an AST such that the innermost nested brackets are evaluated first? The solution is quite simple, possibly a little subtle? First of all lets take a look at the content of the old parser:
      </para>

      <programlisting>
expr     : sumExpr SEMI;
sumExpr  : prodExpr ((PLUS^|MINUS^) prodExpr)*;
prodExpr : powExpr ((MUL^|DIV^|MOD^) powExpr)* ;
powExpr  : atom (POW^ atom)? ;
atom     : INT ;
      </programlisting>

      <para>
        <emphasis>powExpr</emphasis> has a higher precedence than <emphasis>prodExpr</emphasis> and is made to be a child of a <emphasis>prodExpr</emphasis> because of the tree structure. During a transversal of the tree, the children will be evaluated independently of the root because the values of the evaluation of the children are assigned to the variables <emphasis role="strong">a</emphasis> and <emphasis role="strong">b</emphasis> before the operation is applied and the result returned:
      </para>

      <programlisting>
#(PLUS  a=expr b=expr)  { r=a+b; }        
      </programlisting>

      <para>
        See that <emphasis role="strong">a</emphasis> and <emphasis role="strong">b</emphasis> have to be evaluated first because the operation is only executed after a successful match of the rule which means that both <emphasis role="strong">a</emphasis> and <emphasis role="strong">b</emphasis> must have been evaluated. The most basic component of this AST is an <emphasis>atom</emphasis>.  We want to say that an atom can also be another expression, one can not just redefine atom to equal:
      </para>

      <programlisting>
atom     : INT | expr ;
      </programlisting>

      <para>This would generate some infinite recursion errors:</para>

      <programlisting>
expression.g:8: infinite recursion to rule sumExpr from rule atom
expression.g:7: infinite recursion to rule sumExpr from rule powExpr
expression.g:6: infinite recursion to rule sumExpr from rule prodExpr
expression.g:5: infinite recursion to rule sumExpr from rule sumExpr
expression.g:8: infinite recursion to rule sumExpr from rule atom
      </programlisting>

      <para>
        Imagine the case of an out of place token. The parser would check the token against all the rules and find that it did not make a match, it would reach the bottom and check it against <emphasis>atom</emphasis>, it would find that <emphasis role="strong">INT</emphasis> did not match so it would then check it against <emphasis>expr</emphasis> which would cause another check through all the rules and so on, forever. So <emphasis>expr</emphasis> has to be redefined as:
      </para>
        
      <programlisting>
expr     : LPAREN^ sumExpr RPAREN! ;
      </programlisting>

      <para>
        This redefinition is crucial, <emphasis role="strong">LPAREN</emphasis> and <emphasis role="strong">RPAREN</emphasis> are tokens defined in the lexer and represent <emphasis role="strong">'('</emphasis> and <emphasis role="strong">')'</emphasis> respectively.  This rule says to match <emphasis role="strong">LPAREN</emphasis> followed by a <emphasis>sumExpr</emphasis> followed by <emphasis role="strong">RPAREN</emphasis>. <emphasis role="strong">LPAREN</emphasis> is postfixed with a caret ('^') which indicates that in the generated AST <emphasis role="strong">LPAREN</emphasis> should become a tree root and have the child <emphasis>sumExpr</emphasis>, <emphasis role="strong">RPAREN</emphasis> does not become a child because it is prefixed with an exclamation mark, this is because it is unnecessary in the final AST. <emphasis role="strong">SEMI</emphasis> was also removed so that nested expressions do not have to be terminated by a <emphasis role="strong">SEMI</emphasis>.
      </para>

      <para>
        If the out of place token arose again, the parser could not get in an infinite loop because the <emphasis>expr</emphasis> rule begins with a <emphasis role="strong">LPAREN</emphasis>, hence if the token did not match <emphasis role="strong">LPAREN</emphasis> the parser would immediately throw a noViableAltException. There is no way the rogue token could get past the <emphasis>expr</emphasis> rule. Here is the new parser definition in one block:
      </para>

      <programlisting>
expr     : LPAREN^ sumExpr RPAREN! ;
sumExpr  : prodExpr ((PLUS^|MINUS^) prodExpr)* ;
prodExpr : powExpr ((MUL^|DIV^|MOD^) powExpr)* ;
powExpr  : atom (POW^ atom)? ;
atom     : INT | expr ;
      </programlisting>

      <para>
        So now an <emphasis>expr</emphasis> can recursively contain as many <emphasis>expr</emphasis>'s as is desired. What should the treeparser do when it encounters a tree or subtree with <emphasis role="strong">LPAREN</emphasis> as root and an <emphasis>expr</emphasis> as a child? The desired action is to return the value of the evaluation of the child. The addition of this rule into the tree parser achieves this:
      </para>

      <programlisting>
  | #(LPAREN a=expr) { r=a;}
      </programlisting>

      <para>
        This rule matches an AST with a <emphasis role="strong">LPAREN</emphasis> as root and an <emphasis>expr</emphasis> as a child.  The result of the evaluation of <emphasis>expr</emphasis> is assigned to the <emphasis role="strong">a</emphasis> variable which in turn is assigned to the <emphasis role="strong">r</emphasis> variable, hence <emphasis role="strong">r</emphasis> will not receive a value until <emphasis role="strong">a</emphasis> does and <emphasis role="strong">a</emphasis> will not receive a value until <emphasis>expr</emphasis> has been matched. Causing a knock on evaluation of all subtrees of <emphasis>expr</emphasis> in order to match the original <emphasis>expr</emphasis>, this may include evaluation of other <emphasis>expr</emphasis>''s.
      </para>
        
      <para>
        Eventually all leaves of the AST must be <emphasis role="strong">INT</emphasis>'s, unless an infinite number of sub expressions were contained within the master expression, this is unlikely. It is assumed that the person creating the expression has their sanity intact and is not attempting to generate some kind of crazy AST that will keep calling <emphasis>expr</emphasis> billions of recursive levels down, in order to match some infinite order of nested <emphasis>expr</emphasis>', pertaining to the match of a master <emphasis>expr</emphasis>, I am not sure if this is even possible.
      </para>
        
      <para>
        In the normal world, the calls to <emphasis>expr</emphasis> will return a final value as a result of all these <emphasis role="strong">INT</emphasis> leaves being used in the various operations specified in the expression. The values of these operations will work there way up the recursive levels until eventually the original <emphasis>expr</emphasis> has been matched, whereupon, <emphasis role="strong">a</emphasis> will be assigned the value of the <emphasis>expr</emphasis>, <emphasis role="strong">r</emphasis> will be assigned the value of <emphasis role="strong">a</emphasis> and <emphasis role="strong">r</emphasis> will be returned.
      </para>
      
      <para>
        To restate: Whenever an <emphasis role="strong">LPAREN</emphasis> is encountered, the sub-expression will be evaluated, the evaluation of the sub-expression will return a value which will then be used in the evaluation of the <emphasis>expr</emphasis> that had the sub-expression as a child. The result evaluation of this, may in turn, be used in the evaluation of the parent <emphasis>expr</emphasis> until no more parent <emphasis>expr</emphasis>'s are present and the master <emphasis>expr</emphasis> has been evaluated.
      </para>

     <para>
        An AST illustrating this should help to clarify things, here is an AST of the expression <emphasis role="strong">(5*(2+3))</emphasis>:
      </para>

      <figure><title>AST of (5*(2+3))</title>
        <mediaobject>
          <imageobject><imagedata fileref="files/images/times5plus23ast.png" format="PNG"/></imageobject>
        </mediaobject>
      </figure>

      <para>
        The root of the tree is an <emphasis>expr</emphasis> signified by the open bracket '<emphasis role="strong">LPAREN</emphasis>' as expected. It's child is a <emphasis>prodExpr</emphasis> whose first child is an <emphasis>atom</emphasis> which is an <emphasis role="strong">INT</emphasis> equal to five and whose second child is another <emphasis>atom</emphasis> but this time a <emphasis>expr</emphasis> whose child is a <emphasis>prodExpr</emphasis> containing two children which are both <emphasis>atom</emphasis>'s and are <emphasis role="strong">INT</emphasis>'s which are equals to the values two and three respectively. Here is another illustration of this AST to help clarify things: 
      </para>

      <figure><title>Another View of (5*(2+3))</title>
        <mediaobject>
          <imageobject><imagedata fileref="files/images/times5plus23astext.png" format="PNG"/></imageobject>
        </mediaobject>
      </figure>

      <para>
        Here is the whole grammar: <ulink url="files/expression2.g">expression2.g</ulink>. The <emphasis>Main</emphasis> method used to run it is the same as for the original expression evaluator.
      </para>

      <note>
        <para>
          Adding subexpressions also solves the problem of being limited to a single level of exponential, the ambiguous expression <emphasis role="strong">(2^5^5)</emphasis> can be respecified as <emphasis role="strong">(2^(5^))</emphasis> to get the desired result.
        </para>
      </note>
    </sect2>

    <sect2 id="ANTLR-ExpressionEval-Extensions-"><title>Adding The Sign Operator</title>
      <para>
        The sign operator has a higher precedence than any of the operators already defined in this expression evaluator, hence, it's rule will occur just above <emphasis>atom</emphasis>. There is no need to use syntactic predicates to distinguish between dyadic and monadic use of the <emphasis role="strong">PLUS</emphasis> and <emphasis role="strong">MINUS</emphasis> operators, this is implied by the context in which the expression is used.  If trying to match the dyadic use of the <emphasis role="strong">PLUS</emphasis> and <emphasis role="strong">MINUS</emphasis> operators, the parser will expect to see the operator infixed between two <emphasis>prodExpr</emphasis>'s, where as, in the monadic sense, the parser will expect to see the operator prefixing an <emphasis>atom</emphasis>. Here is how it is done: 
      </para>

      <programlisting>
imaginaryTokenDefinitions :
   SIGN_MINUS
   SIGN_PLUS
;

expr     : LPAREN^ sumExpr RPAREN! ;
sumExpr  : prodExpr ((PLUS^|MINUS^) prodExpr)* ;
prodExpr : powExpr ((MUL^|DIV^|MOD^) powExpr)* ;
powExpr  : signExpr (POW^ signExpr)? ;
signExpr : (
         m:MINUS^ {#m.setType(SIGN_MINUS);}
         | p:PLUS^  {#p.setType(SIGN_PLUS);}
         )? atom ;
atom     : INT | expr ;
      </programlisting>

      <para>Ignore the imaginary token business for now, this will be explained later.</para>

      <para>
        Consider the matching of the expression <emphasis role="strong">(-3--2)</emphasis>. The first token is <emphasis role="strong">LPAREN</emphasis> which is matched by <emphasis role="strong">expr</emphasis>. The second token is <emphasis role="strong">MINUS</emphasis> so the parser checks whether this token can be a <emphasis>sumExpr</emphasis> which is the rule it is trying to match to match the <emphasis>expr</emphasis> that has just been opened. Is the first token of <emphasis>sumExpr</emphasis> a <emphasis role="strong">MINUS</emphasis>? No, it's a subrule so <emphasis>prodExpr</emphasis> is called to find out if the first token of that rule is a <emphasis role="strong">MINUS</emphasis>. Is the first token of <emphasis>prodExpr</emphasis> a <emphasis role="strong">MINUS</emphasis>? No, it's a subrule so <emphasis>powExpr</emphasis> is called to find out if the first token of that rule is a <emphasis role="strong">MINUS</emphasis>. Is the first token of <emphasis>powExpr</emphasis> a <emphasis role="strong">MINUS</emphasis>? No, it's a subrule so <emphasis>signExpr</emphasis>is called to find out if the first token of that rule is a <emphasis role="strong">MINUS</emphasis>.
      </para>
        
      <para>
        Is the first token of <emphasis>signExpr</emphasis> a <emphasis role="strong">MINUS</emphasis>? Yes, it is! Brilliant, but, the <emphasis>MINUS</emphasis> must be followed by an <emphasis>atom</emphasis> for <emphasis>signExpr</emphasis> to match. The next token is an <emphasis role="strong">INT</emphasis>(3) which <emphasis>is</emphasis> an <emphasis>atom</emphasis> so finally something has been matched, a <emphasis>signExpr</emphasis> has been matched. Hang-on, let's not get carried away here, this whole process stemmed from the fact that the parser was trying to match the first token of <emphasis>sumExpr</emphasis> because it's parent <emphasis>expr</emphasis> must contain one. This caused calls to all the subrules ending up at <emphasis>signExpr</emphasis>. Now that a match has occurred the parser works it's way back up the recursive levels and discovers that it has matched a <emphasis>powExpr</emphasis> because a <emphasis>powExpr</emphasis> may consist of just a <emphasis>signExpr</emphasis> which in turn means it has matched a <emphasis>prodExpr</emphasis> because a <emphasis>prodExpr</emphasis> can consist of just a <emphasis>powExpr</emphasis>, then it discovers that it has also matched a <emphasis>sumExpr</emphasis> because a <emphasis>sumExpr</emphasis> can consist of just a <emphasis>prodExpr</emphasis>, effectively it has almost matched an <emphasis>expr</emphasis> because if the next token is a <emphasis role="strong">RPAREN</emphasis> the <emphasis>expr</emphasis> would be finished.  However, the <emphasis>sumExpr</emphasis> rule may not be finished so first the parser must check to see if next token matches the next token of the <emphasis>sumExpr</emphasis> rule.
      </para>

      <programlisting>
sumExpr  : prodExpr ((PLUS^|MINUS^) prodExpr)* ;
      </programlisting>

      <para>
        The next token of the <emphasis>sumExpr</emphasis> rule is <emphasis>PLUS</emphasis> OR <emphasis role="strong">MINUS</emphasis>, the next token in the expression being matched is <emphasis role="strong">MINUS</emphasis>, hence the <emphasis role="strong">MINUS</emphasis> in <emphasis>prodExpr</emphasis> is matched. There cannot be any ambiguity here because <emphasis>signExpr</emphasis> is not an alternative at this point.  There is no rule which goes (<emphasis>prodExpr</emphasis> <emphasis>signExpr</emphasis>) so <emphasis>sumExpr</emphasis> is the only possible route, if <emphasis>sumExpr</emphasis> is not matched, <emphasis>expr</emphasis> will not be matched (because <emphasis>expr</emphasis> would be expecting <emphasis role="strong">RPAREN</emphasis>) and the expression will fail.  The parser will finish checking whether or not this rule matches before doing anything else anyway, so even if there was a (<emphasis>prodExpr</emphasis> <emphasis>signExpr</emphasis>) rule (occurring later in the parser), <emphasis>sumExpr</emphasis> would have to fail for that to be checked.  There is no other rule so the parser does match the <emphasis role="strong">MINUS</emphasis> in <emphasis>sumExpr</emphasis> and the matching of <emphasis>sumExpr</emphasis> goes on as expected.
      </para>

      <para>
        The parser has just matched the <emphasis role="strong">MINUS</emphasis> in <emphasis>sumExpr</emphasis>, it now tries to match <emphasis>prodExpr</emphasis> which is the third component of <emphasis>sumExpr</emphasis>. <emphasis>prodExpr</emphasis> is matched in exactly the same manner that the first <emphasis>prodExpr</emphasis> was matched. This means that <emphasis>sumExpr</emphasis> has been matched and the parser has matched the second token of <emphasis>expr</emphasis>. The parser now expects <emphasis role="strong">RPAREN</emphasis> from the lexer, which it gets, and <emphasis>expr</emphasis> is successfully matched. Here is a <ulink url="files/images/sillyastthing.png">ridiculously over the top illustration of this process</ulink>.
      </para>

      <para>
        Hopefully I have not obfuscated the working of this rule. Let's take a closer look at the workings of it:
      </para>

      <programlisting>
signExpr : (
         m:MINUS^ {#m.setType(SIGN_MINUS);}
         | p:PLUS^  {#p.setType(SIGN_PLUS);}
         )? atom ;        
      </programlisting>

      <para>
        First of all, note that the first part of <emphasis>signExpr</emphasis> is enclosed within parentheses postfixed with a question mark indicating that a <emphasis>signExpr</emphasis> can be either an <emphasis>atom</emphasis> or a sign symbol followed by an atom.  The rule says that if a <emphasis role="strong">MINUS</emphasis> or <emphasis role="strong">PLUS</emphasis> is encountered, the <emphasis role="strong">MINUS</emphasis> or <emphasis role="strong">PLUS</emphasis> should become the root of a new subtree with <emphasis>atom</emphasis> as the only child.  This is exactly the desired behaviour since by making the <emphasis>atom</emphasis>a child, we can recognise the root in the tree from the tree parser and perform some action on this child.  This is where the imaginary tokens come in:
      </para>

      <programlisting>
        m:MINUS^
      </programlisting>

      <para>
        Assigns the root node <emphasis role="strong">MINUS</emphasis> to the variable '<emphasis role="strong">m</emphasis>'
      </para>

      <programlisting>
         {#m.setType(SIGN_MINUS);}
      </programlisting>

      <para>
        Replaces this root node in the tree with the token <emphasis role="strong">SIGN_MINUS</emphasis> instead of <emphasis role="strong">MINUS</emphasis>. This is done because the tree is already using the root <emphasis role="strong">MINUS</emphasis> to recognise that it should perform a dyadic subtraction operation: 
      </para>

      <programlisting>
  | #(MINUS a=expr b=expr) { r=a-b; }
      </programlisting>

      <para> 
        In order not to have some kind of syntactic predicate to determine which kind of <emphasis role="strong">MINUS</emphasis> operation to perform, another token is created to represent the monadic <emphasis role="strong">MINUS</emphasis> operation called <emphasis role="strong">SIGN_MINUS</emphasis>.  The same thing is done for the <emphasis role="strong">PLUS</emphasis> operator:
      </para>

      <programlisting>
imaginaryTokenDefinitions :
   SIGN_MINUS
   SIGN_PLUS
;
      </programlisting>

      <para>
        When parsing the AST, the tree parser knows exactly which kind of operation to perform. The full tree parser section is shown below:
      </para>

      <programlisting>
{import java.lang.Math;}
class ExpressionTreeWalker extends TreeParser;

expr returns [double r]
  { double a,b; r=0; }

  : #(PLUS  a=expr b=expr) { r=a+b; }
  | #(MINUS a=expr b=expr) { r=a-b; }
  | #(MUL   a=expr b=expr) { r=a*b; }
  | #(DIV   a=expr b=expr) { r=a/b; }
  | #(MOD   a=expr b=expr) { r=a%b; }
  | #(POW   a=expr b=expr) { r=Math.pow(a,b); }
  | #(LPAREN a=expr)       { r=a; }
  | #(SIGN_MINUS a=expr)   { r=-1*a; } 
  | #(SIGN_PLUS  a=expr)   { if(a&lt;0)r=0-a; else r=a; }
  | i:INT { r=(double)Integer.parseInt(i.getText()); }
  ;        
      </programlisting>

      <para>
        If a <emphasis role="strong">SIGN_MINUS</emphasis> root is encountered, the desired consequence is to negate the sign of the operand, this is achieved by multiplying the operand by -1.  If a <emphasis role="strong">SIGN_PLUS</emphasis> root is encountered, the desired consequence is to do nothing if the operand is already positive and to make the operand positive if it is negative, this is achieved by having a conditional statement which subtracts the operand from zero if it is negative and does nothing to the operand otherwise. The whole grammar can be found here: <ulink url="files/expression3.g">expression3.g</ulink>. The <emphasis>Main</emphasis> method used to run it is the same as for the original expression evaluator.
      </para>

      <note>
         <para>
            As pointed out to me by Safak Oekmen, this interpretation of <emphasis role="strong">SIGN_PLUS</emphasis> is quite strange; usually one would not assume that a positive sign prefix would change the sign of a following negative number to positive. However, this doesn't affect the pedagogical implications of the example if <emphasis role="strong">SIGN_PLUS</emphasis> takes on the indicated role so it will be left as is.
         </para>
      </note>
      <para>
        Let's look at an example run through of the expression <emphasis role="strong">(-3--2)</emphasis>, it is assumed that ANTLR has been ran on the grammar, all classes compiled and the expression fed to the lexer via <emphasis role="strong">Main</emphasis>.  Here is the stringList and value printed: 
      </para>

      <screen>
 ( ( ( - ( - 3 ) ( - 2 ) ) )
Value: -1.0
      </screen>

      <para>Here is the AST produced:</para>

      <figure><title>(-3--2) AST</title>
        <mediaobject>
          <imageobject><imagedata fileref="files/images/minusminus3minus2AST.png" format="PNG"/></imageobject>
        </mediaobject>
      </figure>

      <para>Here are some more AST's:</para>

      <figure><title>A few AST's</title>
        <mediaobject>
          <imageobject><imagedata fileref="files/images/a-few-asts.png" format="PNG"/></imageobject>
        </mediaobject>
      </figure>

      <note>
        <para>In the parser for the expression evaluator, a top level expression was specified as:</para>
        <programlisting>expr     : LPAREN^ sumExpr RPAREN! ;</programlisting>
        <para>If we were reading the expression from a file, this rule should actually be:</para>
        <programlisting>expr     : LPAREN^ sumExpr RPAREN! EOF! ;</programlisting>
        <para>So that the EOF(End Of File), token is matched and the whole rule is matched, as it happens, the parser will match as much as it can anyway, so it matches enough to build the AST. The inclusion of EOF in the rule would not allow us to enter expressions at the command line unless one had a way to specify the EOF character. Bear this note in mind when constructing parsers that read from files.
        </para>
      </note>
    </sect2>
  </sect1>

  <sect1 id="ANTLR-Translation-Example"><title>A Translation Example - CSV to XHTML Table</title>
    <para>
      A Lexer translates from a stream of characters to a stream of Tokens. A parser recognises certain sequences of Tokens and performs actions based on this recognition such as perhaps converting the sequence to a sequence of machine language instructions to be executed later, or perhaps generating an AST for later transversal. The more usual of the two would be to first generate an AST - another translation - and then this AST would later be parsed itself to generate some other form of output.
      Compilation can be seen as a series of translations, this section will illustrate the idea of translation with a simple ANTLR example that translates a comma separated variable (CSV) file to an XHTML table.
    </para>

    <para>
      A CSV is a very simple kind of data structure; variables separated by commas and newlines to create a kind of table. An example will clarify this:
    </para>

    <programlisting>
&quot;STUDENT ID, &quot;NAME,            &quot;DATE OF BIRTH
&quot;129384,     &quot;Davy  Jones,     &quot;03/04/81
&quot;328649,     &quot;Clare Manstead,  &quot;30/11/81
&quot;237090,     &quot;Richard Stoppit, &quot;22/06/82
&quot;523343,     &quot;Brian Hardwick,  &quot;15/11/81
&quot;908423,     &quot;Sally Brush,     &quot;06/06/81
&quot;328453,     &quot;Elisa Strudel,   &quot;12/09/82
&quot;883632,     &quot;Peter Smith,     &quot;03/05/83
&quot;542033,     &quot;Ryan Alcot,      &quot;04/12/80
    </programlisting>

    <para>
      The translator developed in this section will translate CSV's of this type. The structure of the CSV will be discussed whilst simultaneously developing the parser. This will illustrate the very first translation, that of translating from general ideas about the structure of a file to an ANTLR parser capable of recognising this structure.
    </para>

    <para>The CSV <emphasis>file</emphasis> is composed of one or more <emphasis>line</emphasis>s and terminates with an EOF token:</para>

    <programlisting>
file   : ( line (NEWLINE line)* (NEWLINE)? EOF )
    </programlisting>

    <para>
      A <emphasis>line</emphasis> consists of one or more <emphasis role="strong">record</emphasis>s, <emphasis role="strong">NEWLINE</emphasis> is handled by <emphasis>file</emphasis>.
    </para>

    <programlisting>line   : ( (record)+ ) ;</programlisting>

    <para>
      A <emphasis>record</emphasis> consists of a <emphasis role="strong">RECORD</emphasis> token, optionally followed by a <emphasis role="strong">COMMA</emphasis> token (because tokens at the end of a line or file are not followed by a <emphasis role="strong">COMMA</emphasis>):
    </para>

    <programlisting>record : ( (r:RECORD) (COMMA)? ) ;</programlisting>

    <para>
      Notice that the last line of the CSV is an exceptional case because it terminates with <emphasis role="strong">EOF</emphasis> instead of <emphasis role="strong">NEWLINE</emphasis>, this is handled by the <emphasis>file</emphasis> rule, which says that a file begins with a line, has zero or more (<emphasis role="strong">NEWLINE</emphasis> line) sequences then has an optional <emphasis role="strong">NEWLINE</emphasis> and then finally terminates with an <emphasis role="strong">EOF</emphasis> token.
    </para>

    <programlisting>
class CSVParser extends Parser;
file   : ( line (NEWLINE line)* (NEWLINE)? EOF )
line   : ( (record)+ ) ;
record : ( (r:RECORD) (COMMA)? ) ;
    </programlisting>

    <para>It is coupled with this lexer:</para>

    <programlisting>
class CSVLexer extends Lexer;
options { charVocabulary='\3'..'\377'; }
RECORD  : '&quot;'! (~(','|'\r'|'\n'))+ ;
COMMA   : ',' ;
NEWLINE : ('\r''\n')=&gt; '\r''\n' //DOS
        | '\r'                  //MAC
        | '\n'                  //UNIX
        { newline(); }
        ;
WS      : (' '|'\t') { $setType(Token.SKIP); } ;
    </programlisting>

    <para>
      First of all, <emphasis>charVocabulary</emphasis> is set to <emphasis role="strong">'\3'..'\377'</emphasis>, this defines the set of Unicode characters that characters in the inputstream must belong to. It also implicitly defines which characters will be used as alternatives when an &quot;everything except&quot; kind of rule is specified.
    </para>

    <para>
      <emphasis role="strong">RECORD</emphasis> is an example of an &quot;everything except&quot; rule and says that a <emphasis role="strong">RECORD</emphasis> consists of one or more characters in the defined character range, except comma, carriage-return or newline. Notice that <emphasis role="strong">RECORD</emphasis> was defined as starting with a double quote character, this format was chosen so that the user could include records with spaces in with more ease.  If this double quote was not used to signify the start of a record, the record would consist of all characters from the start of the record up-to the next comma, this would include spaces. If a nicely aligned CSV like: 
    </para>

    <literallayout>
      Dave,    21
      Richard, 55
      Peter,   98
    </literallayout>

    <para>
      Was used, then the records would be &quot;Dave&quot;, <literallayout>&quot;    21&quot;</literallayout>, &quot;Richard&quot;, &quot;55 &quot;, &quot;Peter&quot; and <literallayout>&quot;    98&quot;</literallayout>, which is probably not what is desired.  Of course, one could process the tokens afterward to strip leading and trailing spaces, but then what if some of the tokens <emphasis>should</emphasis> contain leading or trailing spaces? It was decided that in this example the records would be defined in this way. You could play around with these design issues yourself as a learning experience. One more thing to note about this is that the double quote is postfixed with an exclamation mark so that the double quote itself is not included in the actual text of the token. The rest of the lexer rules are self-explanatory.
    </para>

    <para>
      If this lexer and parser were used to process a CSV file, nothing would happen except the rules would be matched correctly and the program would terminate with an exit status of zero. For illustration, I added some print statements to the lexer so that it would print out when a rule was called and if it was matched and what the record was:
    </para>

    <programlisting>
class CSVParser extends Parser;
options { k=2; }
file   {System.out.println(&quot;file called&quot;);}
       : ( line (NEWLINE line)* (NEWLINE)? EOF)
       {System.out.println(&quot;file matched&quot;);}
       ;

line   {System.out.println(&quot;line called&quot;);}
       : ( (record)+ )
       {System.out.println(&quot;line matched&quot;);}
       ;

record {System.out.println(&quot;record called&quot;);}
       : ( (r:RECORD) (COMMA)? )
       {System.out.println(&quot;record = &quot; + r.getText());
        System.out.println(&quot;record matched&quot;);}
       ;
    </programlisting>

    <para>Notice the lookahead of two to distinguish between (<emphasis role="strong">NEWLINE</emphasis> line) and (<emphasis role="strong">NEWLINE</emphasis>) <emphasis role="strong">EOF</emphasis>. The parser was ran (via a Main method explained later), on the following file:</para>

    <programlisting>
&quot;David Sprocket, &quot;89
&quot;Cindy Brocket,  &quot;18
&quot;Michael Rocket, &quot;33
    </programlisting>

    <para>The output that was produced is shown below (reformatted by hand to make it clearer):</para>

    <programlisting>
file called
  line called
    record called
      record = David Sprocket
    record matched

    record called
      record = 89
    record matched
  line matched

  line called
    record called
      record = Cindy Brocket
    record matched

    record called
      record = 18
    record matched
  line matched

  line called
    record called
      record = Michael Rocket
    record matched

    record called
      record = 33
    record matched
  line matched
file matched
    </programlisting>

    <para>
      The ANTLR grammar so far can be downloaded here: <ulink url="files/translate/csv1.g">csv1.g</ulink>. It is quite obvious from the output that the parser is performing as it should. To produce a HTML table, one only has to change the output statements so that instead of the parser outputting &quot;file called&quot;, or whatever, it outputs &quot;&lt;table&gt;&quot; instead or whatever the <acronym>HTML</acronym> equivalent should be. The parser, modified to output HTML statements is shown below: 
    </para>

    <programlisting>
class CSVParser extends Parser;
options { k=2; }
file   {System.out.println(&quot;&lt;table align=\&quot;center\&quot; border=\&quot;1\&quot;&gt;&quot;);}
       : ( line (NEWLINE line)* (NEWLINE)? EOF)
       {System.out.println(&quot;&lt;/table&gt;&quot;);}
       ;

line   {System.out.println(&quot;  &lt;tr&gt;&quot;);}
       : ( (record)+ )
       {System.out.println(&quot;  &lt;/tr&gt;&quot;);}
       ;

record {System.out.print(&quot;    &lt;td&gt;&quot;);}
       : ( (r:RECORD) (COMMA)? )
       {System.out.print(r.getText());
        System.out.println(&quot;&lt;/td&gt;&quot;);}
       ;
    </programlisting>

    <para>The output produced when processing the same file as before is:</para>

    <screen>
&lt;table align=&quot;center&quot; border=&quot;1&quot;&gt;
  &lt;tr&gt;
    &lt;td&gt;David Sprocket&lt;/td&gt;
    &lt;td&gt;89&lt;/td&gt;
  &lt;tr/&gt;
  &lt;tr&gt;
    &lt;td&gt;Cindy Brocket&lt;/td&gt;
    &lt;td&gt;18&lt;/td&gt;
  &lt;tr/&gt;
  &lt;tr&gt;
    &lt;td&gt;Michael Rocket&lt;/td&gt;
    &lt;td&gt;33&lt;/td&gt;
  &lt;tr/&gt;
&lt;/table&gt;
    </screen>

    <para>
      Which is what was intended of the translator. Apart from perhaps to mess around with how the output should look all that is needed now is to output the rest of the HTML file. Generation of the rest of the <acronym>HTML</acronym> file could be done in the parser by adding more to the <emphasis>file</emphasis> rule but this would clutter the parser, instead let's designate this task to the class implementing this parser and lexer.  A <emphasis>main</emphasis> method could be placed in the parser part of the grammar so that, when the parser is generated, it has a <emphasis>main</emphasis> method that can be executed. Alternatively a separate class could be created that has a <emphasis>main</emphasis> method which implements the translator. This example will use a separate class. Here is that separate class:
    </para>

    <programlisting>
import java.io.*;
public class Main {
   public static void main(String args[]) {
      if(args.length==0) { error(); }

      FileInputStream fileInput = null;
      try { 
         fileInput = new FileInputStream(args[0]);
      } catch(Exception e) { error(); }

      try {
         DataInputStream input = new DataInputStream(fileInput);

         CSVLexer csvLexer =   new CSVLexer(input);
         CSVParser csvParser = new CSVParser(csvLexer);
         csvParser.file();
      } catch(Exception e) { System.err.println(e.getMessage()); }
   }

   private static void error() {
      System.out.println(&quot;*-----------------------*&quot;);
      System.out.println(&quot;| USAGE:                |&quot;);
      System.out.println(&quot;|   java Main inputfile |&quot;);
      System.out.println(&quot;*-----------------------*&quot;);
      System.exit(0);
   }
}
    </programlisting>

    <para>
      The program first checks that the one compulsory command line argument has been provided. If it has not the program prints an error and exits. If it has, the program creates a new FileInputStream from the file specified, if there are any errors an error is printed and the program exits. If there are no errors the program enters another <emphasis>try</emphasis> block where a <emphasis>DataInputStream</emphasis> is setup using the file, the lexer created with this stream, the parser created with this lexer and then the parsers <emphasis>file</emphasis> method is called. If there are any <emphasis>Exception</emphasis>s thrown, they are caught and the message that came along with it is printed.
    </para>

    <para>
      <orderedlist>
        <listitem>
          <para>
            The Main program used to run the translator can be downloaded here: <ulink url="files/translate/Main.java">Main.java</ulink>.
          </para>
        </listitem>
        <listitem>
          <para>
            The full grammar file can be downloaded here: <ulink url="files/translate/csv2.g">csv2.g</ulink>.
          </para>
        </listitem>
        <listitem>
          <para>
            Some test data can be downloaded here: <ulink url="files/translate/test.txt">test.txt</ulink>
          </para>
        </listitem>
      </orderedlist>
    </para>
    
    <para>
      At the moment the parser outputs to <emphasis>stdout</emphasis>, this is not very useful. It would be more useful if the parser returned a string of the HTML output so that the calling class could do whatever it wanted with it, such as outputting it to a file. The parser must be modified to return a string: 
    </para>

    <programlisting>
class CSVParser extends Parser;
options { k=2; }
file returns[String table = new String()] 
       {String lineData; table+=&quot;&lt;table align=\&quot;center\&quot; border=\&quot;1\&quot;&gt;\n&quot;; }
       : ( lineData=line {table+=lineData;} 
            (NEWLINE lineData=line {table+=lineData;} )*
               (NEWLINE)? EOF )
       {table+=&quot;&lt;/table&gt;&quot;;}
       ;

line returns [String lineData = new String()] 
       {String recordData; lineData+=&quot;  &lt;tr&gt;\n&quot;;}
       : ( (recordData=record {lineData+=recordData;})+ )
       {lineData+=&quot;  &lt;tr/&gt;\n&quot;;}
       ;

record returns [String recordData = new String()]
       {recordData+=(&quot;    &lt;td&gt;&quot;);}
       : ( (rec:RECORD) (COMMA)? )
       {recordData+=(rec.getText()); 
        recordData+=&quot;&lt;/td&gt;\n&quot;;}
       ;
    </programlisting>

    <para>The full grammar can be downloaded here: <ulink url="files/translate/csv3.g">csv3.g</ulink>.</para>

    <para>
      Within <emphasis>file</emphasis>: Immediately the opening <emphasis>table</emphasis> element is appended to the String <emphasis role="strong">table</emphasis> defined within the <emphasis>file</emphasis> rule. The first line is matched, and the value returned assigned to the variable <emphasis role="strong">lineData</emphasis> this is appended to <emphasis role="strong">table</emphasis>. Zero or more lines are matched, and in the process, the values returned from the call to <emphasis>line</emphasis> to match each line are assigned to the variable <emphasis role="strong">lineData</emphasis> which is appended to <emphasis role="strong">table</emphasis>.
    </para>

    <programlisting>
line returns [String lineData = new String()] 
       {String recordData; lineData+=&quot;  &lt;tr&gt;\n&quot;;}
       : ( (recordData=record {lineData+=recordData;})+ )
       {lineData+=&quot;  &lt;tr/&gt;\n&quot;;}
       ;
    </programlisting>

    <para>
      Within <emphasis>line</emphasis>: Immediately the opening <emphasis>&lt;tr&gt;</emphasis> element is appended to the <emphasis role="strong">lineData</emphasis> variable, one or more records are matched, and each time the value returned from the call to <emphasis>record</emphasis> to match a record is assigned to the variable <emphasis role="strong">recordData</emphasis> which is then appended to <emphasis role="strong">lineData</emphasis>.
    </para>

    <programlisting>
record returns [String recordData = new String()]
       {recordData+=(&quot;    &lt;td&gt;&quot;);}
       : ( (rec:RECORD) (COMMA)? )
       {recordData+=(rec.getText()); 
        recordData+=&quot;&lt;/td&gt;\n&quot;;}
       ;
    </programlisting>

    <para>
      Within <emphasis>record</emphasis>: Immediately the opening <emphasis role="strong">td</emphasis> element is appended to the string <emphasis role="strong">recordData</emphasis>. A <emphasis role="strong">RECORD</emphasis> token is matched and assigned to the <emphasis role="strong">rec</emphasis> variable, the textual content of this is then appended to <emphasis role="strong">recordData</emphasis> and the closing <emphasis>&lt;/td&gt;</emphasis> appended to <emphasis role="strong">recordData</emphasis>. The completed record in the form of <emphasis role="strong">recordData</emphasis> is returned to <emphasis>line</emphasis>.
    </para>

    <para>
      <emphasis>line</emphasis> receives <emphasis role="strong">recordData</emphasis> and this is used in the construction of <emphasis role="strong">lineData</emphasis>, when all the records have been matched for a line the closing <emphasis>&lt;/tr&gt;</emphasis> is appended to <emphasis role="strong">lineData</emphasis> and <emphasis role="strong">lineData</emphasis> returned.
    </para>

    <para>
      <emphasis>file</emphasis> receives <emphasis role="strong">lineData</emphasis> and this is used in the construction of <emphasis role="strong">table</emphasis>, when all the lines have been matched for the file the closing <emphasis>&lt;/table&gt;</emphasis> tag is appended to <emphasis role="strong">table</emphasis> and <emphasis role="strong">table</emphasis> returned to the class that instantiated the parser.
    </para>

    <para>Here is the new <emphasis>main</emphasis> method containing class:</para>

    <programlisting>
import java.io.*;
public class CSVhtml {
   public static void main(String args[]) {
      if(args.length!=2) { error(); }

      FileInputStream fileInput = null;
      DataOutputStream fileOutput = null;
      try { 
         fileInput = new FileInputStream(args[0]);
         fileOutput = new DataOutputStream(new FileOutputStream(args[1]));
      } catch(Exception e) { error2(); }

      try {
         DataInputStream input = new DataInputStream(fileInput);

         CSVLexer csvLexer =   new CSVLexer(input);
         CSVParser csvParser = new CSVParser(csvLexer);
         String p =&quot;&quot;;
         p =&quot;&lt;?xml version=\&quot;1.0\&quot; encoding=\&quot;utf-8\&quot;?&gt;\n&quot;;
         p+=&quot;&lt;!DOCTYPE html\n&quot;;
         p+=&quot;PUBLIC \&quot;-//W3C//DTD XHTML 1.0 Transitional//EN\&quot;\n&quot;;
         p+=&quot;\&quot;http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd\&quot;&gt;\n&quot;;
         p+=&quot;&lt;html&gt;\n&quot;;
         p+=&quot;  &lt;head&gt;\n&quot;;
         p+=&quot;    &lt;title&gt;A Table&lt;/title&gt;\n&quot;;
         p+=&quot;     &lt;meta http-equiv=\&quot;Content-Type\&quot; content=\&quot;text/html; charset=utf-8\&quot;/&gt;\n&quot;;
         p+=&quot;  &lt;/head&gt;\n&quot;;
         p+=&quot;  &lt;body&gt;\n&quot;;
         p+= csvParser.file();
         p+=&quot;  &lt;/body&gt;\n&quot;;
         p+=&quot;&lt;/html&gt;&quot;;
         fileOutput.writeBytes(p);
         fileOutput.close();
      } catch(Exception e) { System.err.println(e.getMessage()); }
   }

   private static void error() {
      System.out.println(&quot;*-------------------------------------*&quot;);
      System.out.println(&quot;| USAGE:                              |&quot;);
      System.out.println(&quot;|   java CSVhtml inputfile outputfile |&quot;);
      System.out.println(&quot;*-------------------------------------*&quot;);
      System.exit(0);
   }

   private static void error2() {
      System.out.println(&quot;*------------------------------------*&quot;);
      System.out.println(&quot;| You must specify a valid inputfile |&quot;);
      System.out.println(&quot;*------------------------------------*&quot;);
      System.exit(0);
   }
}
    </programlisting>

    <para>
      The program shown above can be downloaded here: <ulink url="files/translate/CSVHTML.java">CSVHTML.java</ulink>. The program works like this: It is first checked that the command-line arguments are present, if they are, a <emphasis>DataInputStream</emphasis> is created for the inputfile which is specified by the first command-line argument and a <emphasis>DataOutputStream</emphasis> created for the outputfile which is specified by the second command-line argument. The lexers and parsers are created. A string is created and the <acronym>HTML</acronym> prologue appended. The string returned from calling the parsers <emphasis>file</emphasis> method is appended to this string. The <acronym>HTML</acronym> closure is appended to the string and the the string is output to the file supplied.
    </para>

    <para>
      After the grammar is converted to a lexer, parsed by ANTLR and everything compiled, the program can be executed like this:
    </para>

    <screen>
      <userinput><command>java</command> CSVHTML <replaceable>inputfile</replaceable> <replaceable>outputfile</replaceable></userinput>
    </screen>

    <para>The table produced from executing the command:</para>

    <programlisting>
      <userinput><command>java</command> CSVHTML <filename>test.txt</filename> <filename>output.html</filename></userinput>
    </programlisting>

    <para>Looks, under a certain proprietary web-browser, like this:</para>

    <figure><title><filename>test.txt</filename> expressed as a table</title>
      <mediaobject>
        <imageobject><imagedata fileref="files/images/testtextastable.png" format="PNG"/></imageobject>
      </mediaobject>
    </figure>
  </sect1>

  <sect1 id="ANTLR-Translation-Example-Behind-The-Scenes"><title>Snippets From Behind The Scenes</title>
    <para>
      When the lexer tokenizes the input stream, each token encountered is catagorized into the type of token it is, such as a <emphasis role="strong">NEWLINE</emphasis> token. A table of these token types is created and each token type is represented by an integer. The integers 1-3 are special in that they denote predefined token types, user defined tokens are assigned an integer to represent them starting from 4. The integers are mapped to human readable identifiers in a token types file generated by ANTLR, for example:
    </para>

    <programlisting>
// $ANTLR 2.7.1: &quot;csv.g&quot; -&gt; &quot;CSVParser.java&quot;$

public interface CSVParserTokenTypes {
  int EOF = 1;
  int NULL_TREE_LOOKAHEAD = 3;
  int NEWLINE = 4;
  int RECORD = 5;
  int COMMA = 6;
  int WS = 7;
}
    </programlisting>

    <para>
      There is a class called <emphasis role="strong">TokenBuffer</emphasis> whose job it is to buffer the tokens provided by the lexer. It contains a method called <emphasis role="strong">LA</emphasis> which has one parameter, an integer, which determines the token in the token buffer to return, for example <emphasis role="strong">LA(1)</emphasis> would return the integer value of the next token in the TokenBuffer. The parser uses a series of calls to <emphasis role="strong">LA</emphasis> to match the rules it implements, for example:
    </para>

    <programlisting>
// Note, I have cleaned this up a little, ANTLR generates things like
// { {instructions} {instructions} } which can be represented as
// { instructions instructions }
public final void file() throws RecognitionException, TokenStreamException {
   try {      // for error handling
      System.out.println(&quot;file called&quot;);
      int _cnt3=0;
      _loop3:
      do {
         if ((LA(1)==RECORD)) {
            line();
         } else {
            if ( _cnt3&gt;=1 ) { break _loop3; }
            else {throw new NoViableAltException(LT(1), getFilename());}
         }
         _cnt3++;
      } while (true);
   } catch (RecognitionException ex) {

   reportError(ex);
   consume();
   consumeUntil(_tokenSet_0);
}
    </programlisting>

    <para>
      TokenBuffer provides tokens via <emphasis role="strong">LT</emphasis> and tokens via <emphasis role="strong">LA</emphasis>. TokenBuffer gets it's tokens for buffering by calling, the method <emphasis role="strong">nextToken</emphasis> which is defined in the Lexer. The method <emphasis role="strong">nextToken</emphasis> in the Lexer generated for the CSV translator looks like this:
    </para>

    <programlisting>
// Cleaned up a little by me
public Token nextToken() throws TokenStreamException {
   Token theRetToken=null;
tryAgain:
   for (;;) {
      Token _token = null;
      int _ttype = Token.INVALID_TYPE;
      resetText();
      try {   // for char stream error handling
         try {   // for lexical error handling
            switch ( LA(1)) {

               case ',': {
                  mCOMMA(true);
                  theRetToken=_returnToken;
                  break;
               }

               case '\n':  case '\r': {
                  mNEWLINE(true);
                  theRetToken=_returnToken;
                  break;
               }

               case '\t':  case ' ': {
                  mWS(true);
                  theRetToken=_returnToken;
                  break;
               }

               default:
               if ((_tokenSet_0.member(LA(1)))) {
                  mRECORD(true);
                  theRetToken=_returnToken;
               } else {
                  if (LA(1)==EOF_CHAR) {uponEOF(); _returnToken = makeToken(Token.EOF_TYPE);}
                  else {throw new NoViableAltForCharException((char)LA(1), getFilename(), getLine());}
               }
            }
            if ( _returnToken==null ) continue tryAgain; // found SKIP token

            _ttype = _returnToken.getType();
            _ttype = testLiteralsTable(_ttype);
            _returnToken.setType(_ttype);
            return _returnToken;
         } catch (RecognitionException e) {
            throw new TokenStreamRecognitionException(e);
         }
      } catch (CharStreamException cse) {
         if ( cse instanceof CharStreamIOException ) {
            throw new TokenStreamIOException(((CharStreamIOException)cse).io);
         } else {
            throw new TokenStreamException(cse.getMessage());
         }
      }
   }
}</programlisting>

    <para>Notice the bit that says:</para>

    <programlisting>
if (LA(1)==EOF_CHAR) {uponEOF(); _returnToken = makeToken(Token.EOF_TYPE);}
else {throw new NoViableAltForCharException((char)LA(1), getFilename(), getLine());}
    </programlisting>

    <para>
      This says if the token found is an <emphasis role="strong">EOF_CHAR</emphasis>, call the <emphasis>uponEOF</emphasis> method and assign a new <emphasis role="strong">EOF_TYPE</emphasis> token to <emphasis role="strong">_returnToken</emphasis>, which is returned later in this method.
    </para>

    <para>
      The code below shows <emphasis>rec</emphasis> being assigned the Token returned from <emphasis role="strong">LT</emphasis>, later, the <emphasis role="strong">getText()</emphasis> method is invoked on <emphasis>rec</emphasis> to get the tokens textual content.
    </para>

    <programlisting>
   {
   rec = LT(1);
   match(RECORD);
   }
   .
   .
   .
   recordData+=(rec.getText()); 
   recordData+=&quot;&lt;/td&gt;\n&quot;;
    </programlisting>
  </sect1>

  <sect1 id="ANTLR-Thanks"><title>Thanks</title>
    <para>
      Thanks go to Bogdan Mitu for showing me the way with the translator example <emphasis>file</emphasis> rule and Ric Klaren for showing me how blindingly simple it was to do the nested return statements in the translator example.
    </para>
  </sect1>

  <sect1 id="ANTLR-References"><title>References (And links you may find useful)</title>
    <itemizedlist>
      <listitem>
        <para><ulink url="http://www.antlr.org/book/index.html">http://www.antlr.org/book/index.html</ulink></para>
        <para>
          <literallayout>
Practical Computer Language Recognition and Translation
A guide for building source-to-source translators with ANTLR and Java.

Copyright 1999 Terence Parr

Updated 2/1/99
          </literallayout>
        </para>
      </listitem>

      <listitem>
        <para><ulink url="http://www.antlr.org/article/list">http://www.antlr.org/article/list</ulink></para>
        <para>ANTLR articles page - lots of interesting things.</para>
      </listitem>

      <listitem>
        <para>
          <ulink url="http://www.antlr.org/doc/getting-started.html">http://www.antlr.org/doc/getting-started.html</ulink>
        </para>
        <para>
          Getting Started with <acronym>ANTLR</acronym>.
        </para>
      </listitem>

      <listitem>
        <para><ulink url="http://javadude.com/articles/antlrtut/">http://javadude.com/articles/antlrtut/</ulink></para>
        <para>An ANTLR Tutorial by Scott Stanchfield</para>
      </listitem>

      <listitem>
        <para><ulink url="http://topaz.cs.byu.edu/text/html/Textbook/">http://topaz.cs.byu.edu/text/html/Textbook/</ulink></para>
        <para>Compiler Theory And Design</para>
      </listitem>
      <listitem>
        <para>The ANTLR Reference Manual</para>
        <para>Comes included with the ANTLR installation</para>
      </listitem>
    </itemizedlist>
  </sect1>
</article>

