Blog Home  Home Feed your aggregator (RSS 2.0)  
ANTLR C# Overview - Part 1 - Manuel Abadia's ASP.NET stuff
 
# Tuesday, December 26, 2006

Click here for Part 2 of this overview
Click here for Part 3 of this overview

There are some articles out there about ANTLR v3 that cover how to define grammars and tree parses but I haven’t seen any that gives a global vision about all the classes and interfaces involved, so I’ll try to fill that gap. I’m using the C# version of ANTLR 3.0b5 as it is what I use and it has better name conventions for interfaces and classes.

There’s an upcoming book about ANTLR by its autor that will be a very valuable material if you work with ANTLR:

ANTLR lets us create a lexer, a parser and a tree parser. Depending on what you are creating, the classes generated by ANTLR will inherit Lexer, Parser or the TreeParser class. Those 3 classes inherit from a common abstract class called BaseRecognizer where all the code for parsing is present.

The Lexer class is also an abstract class that extends BaseRecognizer and implements the ITokenSource interface.

The ITokenSource interface has only one method, NextToken, that is used to get all tokens from the input. For ANTLR a token is a class that implements the IToken interface:

public interface IToken

{

    int Channel { get; set; }

    int CharPositionInLine { get; set; }

    int Line { get; set; }

    string Text { get; set; }

    int TokenIndex { get; set; }

    int Type { get; set; }

}

 

The basic properties of a token are its type and its text. The Line and CharPositionInLine are used to know where the token was read from the input.

A parser will read tokens from a channel, so the Channel property is usually used to disntinguish from a “normal” token from a white space or other token used as separator that does not alter the input to be parsed.

All tokens read from the input are assigned an unique TokenIndex (starting from 0 and incremented each time a token is read).

The CommonToken class implements IToken and it is the class that is usually used to represent a token in ANTLR. It has two properties not present in the IToken interface, StartIndex and StopIndex, that are used to indicate where the token starts and ends in the input stream. So the Text property returns the data directly from the input stream unless the Text property has been explicitly set. The ToString method prints a token like this:

[@15,23:26=’else’,<10>,1:22]

Where 15 is the TokenIndex, 23:26 are the StartIndex and StopIndex respectively, else is the Text, 10 is the Type, and 1,22 are the Line and CharPositionInLine respectively.

Returning to the Lexer class… it reads the characters from a ICharStream, that is an interface that extends the IIntStream interface for reading characters and adding line and column information to it:

public interface IIntStream

{

    void Consume();

    int Index();

    int LA(int i);

    int Mark();

    void Release(int marker);

    void Rewind();

    void Rewind(int marker);

    void Seek(int index);

    int Size();

}


The IIntStream interface is a contract for a stream of integers with variable lookahead and marker points so the stream can save a return point with the Mark method and return to it using the Rewind method.

There are several classes implementing the ICharStream interface so the Lexer can get data from several sources:

• ANTLRStringStream: to provide data from a string.
• ANTLRFileStream: to provide data from a file.
• ANTLRReaderStream: to provide data from a TextReader.
• ANTLRInputStream: to provide data from a Stream.
 

So finally we’re ready to understand the Lexer class:

public abstract class Lexer : BaseRecognizer, ITokenSource

{

    public Lexer();

    public Lexer(ICharStream input);

 

    public virtual int CharIndex { get; }

    public virtual int CharPositionInLine { get; }

    public virtual ICharStream CharStream { set; }

    public override IIntStream Input { get; }

    public virtual int Line { get; }

    public virtual string Text { get; set; }

 

    public static void DisplayRecognitionError(string name, RecognitionException e);

    public virtual void Emit(IToken token);

    public virtual IToken Emit(int tokenType, int line, int charPosition, int channel, int start, int stop);

    public virtual void Match(int c);

    public virtual void Match(string s);

    public virtual void MatchAny();

    public virtual void MatchRange(int a, int b);

    public abstract void mTokens();

    public virtual IToken NextToken();

    public virtual void Recover(RecognitionException re);

    public override void ReportError(RecognitionException e);

    public void Skip();

}

 

The Lexer works using a ICharStream as its input and offers a bunch of properties to get information about the current character position of the underlying ICharStream.

The main operation accomplished by the Lexer class is to obtain tokens from the input stream when the NextToken method (of the implemented interface ITokenSource) is called.

When a Lexer is generated using ANTLR, a class that inherits Lexer is generated. That class will have a method called mXXX for each lexer rule where XXX is the rule name. Also a method called mTYYY (where YYY is an integer) will be generated for each token implicitly defined in a lexer rule like ‘>=’.

By default, each generated method will check if the expected characters are read from the input stream (using the Math, MathAny and MathRange methods) and then call the Emit method to generate the associated token, that is stored in a field called token.

When the NextToken method is called, the Lexer class the abstract method mTokens that is implemented in the class generated by ANTLR and  what it does is to select one of the m* methods explained before to call based on the input (using a DFA for this). That method will set the token field, and if the token should not be skipped (to skip a token you have to call to the Skip method in the Lexer class) it returns it. If not it repeats the process until a token is returned or the end of the input is found (EOF).

The Parser class inherits from BaseRecognizer and reads tokens from an object that implements ITokenStream. ITokenStream is just a specialization of the IIntStream to work with ITokens instead of ints:

public interface ITokenStream : IIntStream

{

    IToken Get(int i);

    ITokenSource GetTokenSource();

    IToken LT(int k);

    string ToString(int start, int stop);

    string ToString(IToken start, IToken stop);

}

 

ITokenStream uses an object that implements ITokenSource (remember that the Lexer class implements ITokenSource) to obtain tokens. There are two classes implementing ITokenStream, CommonTokenStream and TokenRewriteString. The CommonTokenStream class buffers all the tokens and will return tokens in a specific channel. The TokenRewriteString class has additional functionality to insert, update and delete tokens.

The Parser class is very simple:

public class Parser : BaseRecognizer

{

    public Parser(ITokenStream input);

 

    public override IIntStream Input { get; }

    public virtual ITokenStream TokenStream { get; set; }

}

 

As only adds an additional property to return the ITokenStream that feeds the parser.

When a Parser is generated using ANTLR, a class that inherits from Parser is generated. For each parser rule, that class will have a method with the same name as the rule. The method signature will depend on the attributes used (if any) in the rule and if abstract syntax tree (AST) construction is enabled or not.

Tree parsers will be explained in another post as there is a lot of tree related theory to introduce.

Tuesday, December 26, 2006 4:10:19 PM (Romance Standard Time, UTC+01:00)  #    Comments [2]   ANTLR | Microsoft .NET Framework  | 
Copyright © 2014 Manuel Abadia. All rights reserved.
DasBlog 'Portal' theme by Johnny Hughes.