Building a Code Formatter with ANTLR4 and Python

July 26, 2020

I’m a big fan of Age of Empires II (commonly abbreviated to aoe2, or age2) - a hugely popular RTS game first released in 1999. Thankfully for fans of the game, we’ve been blessed with many revivals of the game, including two official remasters (HD Edition in 2013, and Definitve Edition in 2019), as well as a set of unofficial patches for the original game known as UserPatch. With the release of Definitive Edition, aoe2 has seen a huge revival thanks to a number of quality-of-life improvements (though not welcomed by all), and a large splash of extra eye-candy.

To help practice the game while playing with friends, I’ve been implementing a custom AI for the game, affectionately named “Billy Build Order”, that provides advice while you play an ordinary game. As I added features to the AI, it quickly became apparent I could save a lot of time by generating the AI from a script - there was a large amount of repetition required to work within the limitations of the AI engine. There is one drawback to my AI generation code: the output can be quite messy, particularly where whitespace is concerned.

I want to release the AI as a download on the in-game mod manager in future, and I want people to be able to read and edit the output AI, without needing to be able to run and understand my generation script. More easily readable output will also make debugging of the generated code much easier. Therefore, I needed to implement an automatic code formatter or “pretty printer”¹ - and I thought I’d share how I went about it, in case anyone finds themselves wanting to do similar in future.

How to Format Code

Like every programming project, automatically formatting code can be tackled in a few different ways. You could choose to build from only a simple lexer - a piece of software designed to tokenize input text. This simply groups together input characters into tokens, which are atomic elements of code. Examples of tokens in a regular procedural programming language could include:

identifiers (i.e. variable or function names);
operators (e.g. == or +);
keywords (e.g. return or while);
or literals (e.g. "hello world" or 123.45).

Another option is to fully parse the input text. This involves taking the stream of tokens returned by the lexer and then building a parse tree according to the rules of the language’s grammar. For example, the grammar might dictate that a binary operator must always be surrounded by identifiers or literals, e.g.:

# <IDENTIFIER> <OPERATOR> <LITERAL> - valid
total + 1

# <IDENTIFIER> <OPERATOR> <KEYWORD> - invalid
total + for

If the token stream returned by the lexer doesn’t match the rules of the grammar, then the parser will throw an error and will not be able to continue.

Using a parser has two main benefits:

An AST is easily traversable. The tree data structure returned by the parser handles nesting already, whereas the lexer output is just an array of tokens. Getting a list of all statements in a block, for example, is a straightforward case of looping through a pre-built list of child nodes with a parser. With only a lexer, this would require stepping through the token array and correctly grouping tokens, stopping at the end of block token.
Formatting based on an AST can make use of contextual information. It might be desired to format a given substring differently depending on the context it’s used in.

Using only a lexer has one main benefit:

The formatter can work on invalid code, whereas the parser will fail unless the input is valid. This is particularly useful if you want to be able to format incomplete code.

For my use case, tha ability to work with invalid could instead be considered a drawback, as otherwise my only tool for checking the validity of generated AI code is to load it into the game - not the most efficient build tool in the world! Losing the ability to format invalid or incomplete code is unfortunate, but not really a vital feature.

Therefore, I decided on the parser route. In the rest of this article, I’ll go through the process of building a parser-based code formatter.

Parsing AoE2 AI Code with ANTLR4 and Python

An AoE2 AI is made up of (at least) two files. An empty .ai file that signifies the entry point into the AI, and a .per file that contains the rules². The documentation included with the game never defines what PER stands for (answers on a postcard please), so you’ll have to forgive me never introducing the acronym properly.

I decided on using ANTLR to help build my formatter. ANTLR is a popular parser generator written in Java, with support for many different target languages including Python. This means that building a lexer and parser only requires the developer to define the rules of the grammar in ANTLR syntax, and the tool can then automatically generate code in the chosen target language. If someone later wants to re-use the grammar rules in a different programming language, it’s very easy to do.

Creating the PER Grammar for ANTLR4

Included with the game is a word document (!) written for the original game, detailing how to write AI files. This document, the Computer Player Strategy Builder Guide: AI Expert Documentation, gives a complete reference for AI implementation, but it doesn’t provide a formal description of the grammar. When writing the ANTLR grammar, that meant piecing together different sections of the guide, and making a few assumptions along the way.

When writing an ANTLR grammar, you need to choose where to start, either by starting defining the tokens for the lexer, or defining the parser rules. I found that starting from the top, writing parser rules first and then adding tokens as necessary was the easiest approach to conceptualise.

A rough guide to writing a grammar without a formal spec:

Write a simple test case for a subset of the language.
Write a parser rule (or rules) covering the test case.
Add the lexer rules as required by the parser rules added.
Use grun to run the rules against the test case, and debug any issues.
Repeat the above steps, incrementally expanding the language support of the grammar.

While writing your lexer and parser rules, there are some tips to keep in mind:

Lexer rules must start with an uppercase letter, but it is conventional to use only uppercase (and the opposite is true for parser rules).
Remember that the lexer runs before the parser. This means that the input won’t be parsed based on the context given by parser rules.
Where input text could match multiple rules, the following precedence is used:
1. The longest match takes priority
2. Literal matches take priority over regex matches
3. If neither of the above apply, then the rule defined first in the grammar takes precedence.
Trying to be too specific with tokens can be counter-productive. If you’re having issues with the lexer not choosing the desired token, and re-ordering rules isn’t helping, think about whether you actually need your parser to differentiate between the token types!

Testing and Debugging with grun

Before using grun, it’s necessary to generate and compile the Java version of the parser. Note that this will produce a lot of output, so if you’re doing the same at home you’ll want to sort out your gitignore config sooner rather than later.

$ antlr4 MyGrammer.g4
$ javac *.java

grun has a number of different modes for debugging. The best tool for a first test is tree mode, which will run the parser against the rule and print the resulting parse tree.

$ grun GRAMMAR_NAME rule_name input_path -tree

As an example, running this on the first version of my PER grammar gives the following output:

$ grun PER per -tree testinput
(per (statement ( (command (defrule defrule \n         (proposition_list (proposition ( player-in-game   1 )) \n         (proposition ( player-human   1 )) \n         (proposition ( goal   1   1 )) \n        ) => \n         (action_list (action ( chat-to-player   1   "3 Train villagers, build 2 houses, and send 6 to sheep [7 pop](6 sheep)" )) \n         (action ( set-goal   1   2 )) \n))) )) \n <EOF>)

The output to the console can be somewhat difficult to understand. Each parentheses followed by a rule name in the output shows a new node in the tree, with child nodes and literal input nested inside (note that parentheses in the input are not escaped in any way, which can be confusing depending on your input). If you expand out the example output manually and adjust the whitespace, the tree structure becomes much more apparent:

(per
  (statement (
    (command
      (defrule defrule \n
        (proposition_list
          (proposition ( player-in-game   1 )) \n
          (proposition ( player-human   1 )) \n
          (proposition ( goal   1   1 )) \n
        )       => \n
        (action_list
          (action ( chat-to-player   1   "3 Train villagers, build 2 houses, and send 6 to sheep [7 pop](6 sheep)" )) \n
          (action ( set-goal   1   2 )) \n
        )
      )
    ) )
  ) \n <EOF>
)

If you’re running in a headful environment, you can also use -gui for the same information in an actual graph format, which can be much easier to read. If there are any errors in either your test input or grammar that mean the input can’t be parsed, you’ll see some debugging information in the output that will detail what input ANTLR was expecting versus what was next in the input.

To help work through these errors, either -tokens or -trace modes can be useful for unpicking how ANTLR interpreted your input file. Tokens mode will simply print the list of all tokens as returned by the lexer, whereas trace mode will output the parser rules matched and the tokens for each. Remember that every time you edit the grammar, you will need to regenerate the java sources with antlr and recompile those sources in order for grun to pick up your changes.

Turning an ANTLR Grammar into a Parser

Eventually, you should end up with a grammar that handles everything you want it to - or at least a small enough subset that you can start to format something useful. At this point, unless you want to build your formatter in Java, you’ll need to now generate the parser in your desired target language. In my case, Python:

$ antlr4 -Dlanguage=Python3 PER.g4

This will generate the following files:

PER.interp
PER.tokens
PERLexer.interp
PERLexer.py
PERLexer.tokens
PERListener.py
PERParser.py

The .interp and .tokens files aren’t needed - these are only useful for advanced use cases. The three .py files contain the parser code that you will need to build a formatter.

PERLexer.py: This provides the Lexer subclass for our grammar. It contains a serialised state machine for tokenising input.
PERParser.py: This provides the Parser subclass for our grammar. It once more contains a serialised state machine that acts upon the tokens produced by the lexer. It also contains a number of context classes, one for each parser rule in our grammar. These context classes allow access to all of the tokens and nested contexts found within.
PERListener.py: This class provides us method stubs to execute code whilst automatically walking depth-first through the parse tree.

The generated listener is perfect for a simple code formatter - ANTLR can automatically handle walking through the parse tree depth-first, i.e. in the order that the code appears in the input. Before we can start formatting code, we first need to plug together the generated classes with some supporting library functions provided by antlr4-python3-runtime:

#!/usr/bin/env python3

import sys

from antlr4 import FileStream, CommonTokenStream, ParseTreeWalker

from PERLexer import PERLexer
from PERParser import PERParser
from PERListener import PERListener

def main(argv):
    inp = FileStream(argv[1])
    lexer = PERLexer(inp)
    stream = CommonTokenStream(lexer)
    parser = PERParser(stream)
    # Parse the input starting at the "per" rule.
    tree = parser.per()

    listener = PERListener()
    walker = ParseTreeWalker()
    walker.walk(listener, tree)

if __name__ == "__main__":
    main(sys.argv)

Run the above script with your test input file as the first command line argument… and nothing will happen. If so, congratulations!

The above script uses antlr4’s FileStream³ class to read input from the file path passed on the command line into the generated PERLexer. The CommonTokenStream then wraps the lexer, providing an interface supporting the PERParser.

Finally, we call the method on the parser with the same name as our top level rule, per(). Our generated parser will then parse the input file, returning a parse tree rooted at a per node.

At this point, we could manually manipulate or interact with the parse tree if necessary. In this example, we simply instantiate the generated PERListener class, and pass it as the listener to invoke for each node as we use the default depth-first ParseTreeWalker to walk the tree. Our PERListener class is simply full of method stubs, so the walker walks the tree, but no actions are taken in response, so the script will silently exit.

Formatting Code with a Listener

Finally, we’re ready to start producing formatted code. First off, create a subclass of the generated PERListener, and give the constructor somewhere to write output to, alongside some helper methods for writing:

from PERParser import PERParser
from PERListener import PERListener

class FormattedPERListener(PERListener):
    def __init__(self, out_stream, indent='    '):
        self.__out = out_stream
        self.__indent = indent
        self.__indent_level = 0

    def __write(self, s):
        self.__out.write(s)

    def __line(self, s):
        self.__write('\n')
        pfix = self.__indent * self.__indent_level
        self.__write(pfix)
        self.__write(s)

You should then replace the use of PERListener in your formatter script with this new FormattedPERListener class.

Next, start overriding stub methods from the base class to start writing output:

    # A defrule is the main building block of an AoE2 AI
    # AoE2 AIs are not programmed procedurally, but the closest analogue
    # is of a function definition with a built-in if condition.
    def enterDefrule(self, ctx:PERParser.DefruleContext):
        entertxt = '(' + ctx.DEFRULE().getText()
        self.__line(entertxt)
        # Increase the indentation level as we are now inside a new block
        self.__indent_level += 1

    # An action is a single statement of which many can appear in each defrule.
    # Actions cause the AI to do something, and can have 0 or more arguments.
    def enterAction(self, ctx:PERParser.ActionContext):
        self.__line(ctx.getText())

Some pointers for writing your own listener methods:

Each context class contains a getText() method, which will simply return the text corresponding to the context, verbatim. If you don’t care about reformatting the content of a given context, you can simply write getText()'s return value to the output, which is useful if you want to incrementally work on formatting.
Each context class contains convenience methods for accessing the child nodes by rule or token name. Note that the names will be generated based upon the original grammar file - so by convention, the method to get a token will be in all caps.
Token text can be accessed using the same getText() method as with parser rules.
If a parser rule contains a repeated child, the convenience access method will return a list, or can be provided a numeric index to get only a single element.
Similarly, if a given child is optional, the convenience access method will return None if it did not exist in the input.
The base ParserRuleContext class contains other generic methods for getting access to content, such as getChildCount() and getChildren(). Generally, using the named methods is more convenient, but you may find these useful in some situations.
There are lots of different ways to express the same language with different lexer and parser rules. When writing my listener, I found that some of the rules I had written needed to be refactored to make the formatter listener implementation cleaner.

With these pointers in mind, the above example should be straightforward to understand:

When the walker enters a defrule node, it outputs a literal ( followed by the text of DEFRULE (which always corresponds to the string defrule).
- I have taken a shortcut here - ( is actually the OPEN token’s only possible text value, but OPEN only appears in the parent context of defrule, so I can’t use an OPEN() method.
- After outputting the start of the defrule, the indentation level is incremented.
When the walker enters an action node, it outputs the text of that action without reformatting it on a new line with an indent.
- The indent level will depend on the current value of the __indent_level counter.
- The indent will only apply to the first line of the action. If a newline character appears in the input text of the action, then it will be echoed verbatim and will break the formatting.

Note that the example is very much incomplete - I’d recommend having a look at the full implementation on GitHub for the full source code. Running the formatter using only these 2 methods will produce invalid PER output, as certain parts of the language (e.g. the propositions) are completely ignored!

Keeping Track of State in a Listener

For the basic implementation, using the listener pattern with ANTLR makes it incredibly quick and easy to get started. However, once our implementation starts needing to make use of contextual knowledge about neighbouring or child nodes in the parse tree, we need to start tracking these things manually.

For example, in my formatter, I wanted to only insert blank lines between certain pairs of statements. A blank line between two rule definitions is always desired (similar to how you would always leave a blank line between function definitions in a procedural language). However, if a comment appears between those two rules, we don’t want any blank lines between the comment and the rule, like so:

(defrule
    (true)
    =>
    (set-goal 1 1)
    (disable-self)
)

; some comment
(defrule
    (player-human 1)
    =>
    (chat-to-player 1 "Hello, I am the Aztecs!")
    (disable-self)
)

To determine whether we need to insert a blank line or not, we need to know what type of statement the previous node was (at the same level in the tree). The listener pattern doesn’t give us a straightforward way of keeping track of this - there are two options:

Grab the parent node using ctx.parent(), and from there get the list of children - but there’s no readily available index of the current node inside the parent, so it’s not as simple as checking child i - 1 of the parent node. This is a little messy, and means you end up writing some manual tree traversal on top of the walker.
Take note of the previous statement type manually by adding a property to your listener, and updating it in the relevant enter/exit states. Depending on the structure of the grammar, this might require code inside multiple rule enter/exit functions.

Either way, we need to do some manual work to operate based on the sibling nodes. In the end, I decided to go with the latter approach, as I thought it was more elegant to not traverse back up to the parent and then back down again - but your mileage may vary.

Recursive Rules in a Listener with State

Manually keeping track of state, like we chose to do for recording the previous statement type, gets harder once we start to implement support for recursion in parser rules.

Another example in the PER language that’s relevant are the boolean operators, and and or. All facts in a rule are implicitly conjoined with logical AND, but the explicit boolean logic operators are provided for more complex conditions. For example:

(defrule
    (or
        (players-population 1 == 33)
        (or
            (players-population 1 == 23)
            (players-population 1 == 13)
        )
    )
    =>
    (some-action)
)

This example shows how facts can be nested, and how the boolean operator are themselves facts. Depending on whether the facts within contain facts themselves, we want to switch between the default fact-on-single-line mode, and an expanded mode where we split the arguments to each fact over multiple lines.

To support this, when recursing in the listener, we need to save the current state, and start with a fresh state that can be used as normal, restoring the original state when we step back up into the original rule context.

There’s one obvious answer here - a stack. We can manually manage a stack for each piece of state that requires it, and push a fresh bit of state onto the stack every time we recurse, and pop one level of state off the stack every time we exit from a level of recursion.

This is fine, and works well - but it again forces us into managing the state manually, where in this case it would more naturally fit a recursive function, and the language would then handle saving, restoring, and creating state for us.

Wrap-Up

That about wraps up some of the things I learnt while building the first versions of the code formatter with ANTLR. Overall, I’m quite happy with the design, but I am thankful that PER is a fairly straightforward format - although as it turns out, blogging about a project is a very good method of rubber duck debugging, so there may be some more improvements on the way.

In any case, if you’re interested in looking at the complete working example, please check out the project on GitHub. The full commit history is available, if you’d like to step back in time to when this article written (or perhaps earlier!).

Otherwise, if you’re here because you want to format AoE2 AI code yourself, you can either grab the latest code from GitHub, or use the version available on PyPi. Full instructions on how to use the tool are available there!

This isn’t entirely accurate - I could have simply worked on the code generation to make sure that it always generated neatly formatted code. Having the formatter as a standalone tool/library means that it’s useful for other people generating AI code, or it could even be useful for someone working on AI rules by hand. Another benefit here is that the formatting code can be entirely decoupled from the generation code. I can fully rewrite my generation code if it no longer meets my needs, and I won’t be restricted by the structure of the formatter. It’s somewhat similar to the separation between a model and a view in an MVC application. ↩︎
I plan on writing up how I made the Billy Build Order AI, and a brief description of how AIs work in AoE2, in a future post. Check back later! ↩︎
Watch out - ANTLR’s Python stream classes are not compatible with streams as used by the Python standard library. If you want to parse input from STDIN, then use antlr’s StdinStream class. If you instead want to parse input from a string, then simply pass that string to the constructor of antlr’s InputStream class. ↩︎