Build a Tiny Compiler in Java

Codekrypt Compiler

re you googling the questions “How to create a compiler in Java?”, “Tiny compiler in Java?”, “AST to Java bytecode”. Then you are in the right place. The word Tiny is subjective. But yeah, the code is simple enough to understand the end-to-end flow of compiler development.

Talk is cheap. Show me the code!

There you go: Codekrypt Compiler Github

Pre-Requisite:

  1. Visitor Pattern
  2. Java ASM

Compiler Phases:

  1. Lexical Analysis [String → Token ]
  2. Syntactic Analysis (ie Parsing) [ Token → AST ]
  3. Semantic Analysis [Validating AST]
  4. Optimization (Optional)
  5. Code Generation [AST → Java bytecode]

Grammar:

To keep it dead simple, we will be using the below grammar.

  • Our Program will have multiple Statement.
  • A Statement is either Let or Show.
  • Let is of the form VAR = INT.
  • Show is of the form SHOW INT or SHOW VAR .
  • VAR → Variable ( String with lower or uppercase letters)
  • INT → ( Integer, ie Positive Numberwithout decimal)

This grammar was taken from another article on ANTRL4.

Let's start by implementing the Compiler.

1. Lexical Analysis (Tokenizer)

nextToken() → We iterate through each character and see if it can be converted to a Token.

getCurrentToken() → We fetch the current token which was set, after invoking nextToken().

Our Token will hold the type & value.

2. Parser + AST

Simply put, the parser is all about populating this below skeleton class using Lexer.

The important point to note is ProgramContext, StatementContext, LetContext, ShowContext, TerminalNode are of children of Visitable and ParseTree.

Why?

Visitable → To accept custom Visitor for adding business logic on each node.

ParseTree → For traversing the child nodes and propagating Visitor.

AST

We create a base class (ParserRuleContext)with concretion on common methods, that would be extended by their child classes.

The Grammer elements (ie AST Nodes) extend from this base class.

Grammar Elements

Now the child nodes (ie implementations) of AST will only have the relevant variables and functions. Let's see an example of LetContext.

LetContext node is only responsible for handling variableName & variableValue.

TerminalNodehas a slightly different logic since they don’t have children.

3. Visitor & Semantics

Yes! Visitor, it is. We will be using Visitor in the next 3 stages.

Syntax validates “Arjun 1234 good”. (Is it a valid sentance?)

Semantics validates “Arjun The good is”. (Is it meaningful?)

In our case, semantically speaking, we need to validate if the variable is declared (VAR1 = 10) before it is referenced in SHOW VAR1.

This validation logic can be implemented using the SemanticAnalyzer.

4. Visitor & Interpreter.

The interpreter runs your code line by line.

We will implement a visitor that handles LetContext& ShowContextto print the output of our code.

5. Visitor & Code Generation (Java ASM)

This part will be a bit difficult, but if you have an idea of using Java ASM, then it will be really simple. In this stage, we will be converting our AST to Java bytecode.

The easiest way to generate the ASM code for your class is using ASMifier.

In visitProgram() we open the ClassWriter and main MethodVistor.

Once the ClassWriter and main Method Visitor are opened, we invoke super() to propagate the call to child nodes, ie VisitLet() and VisitShow(), before closing these writers.

In the visitLet() we use Java Ops Codes:

  • BIPUSH (Byte Push)
  • ASTORE + VariableIndex to store reference into a local variable.

Note 1: VariableIndex starts from 1 since index 0 is reserved for args[] in : main(String[] var0) .

Note 2: We save VariableIndex for further referencing in visitShow().

In visitShow(), we use Java Ops code:

  • ALOAD + VariableIndex: to load reference from the local variable (VAR)
  • BIPUSH (Byte Push): If it is an integer constant. (INT)
  • INVOKEVIRTUAL: to invoke System.out.println()

6. Chaining and compiling.

To read Java code from the .class file generated, use IntelliJ’s decompiler.

Decompiled class file

Conclusion

You made it to the end. Cheers🍻!

Java is verbose and it is really difficult to explain the code, line by line. But I have tried my best to give a high-level overview of the Compiler Development process.

Please do check out the complete code in Github. Do star the project, since I may be updating this with a more mature Grammer going forward. The parent project contains examples of the parser-libraries that I have tried out. Hope this article helps someone!

Found it Interesting?
Please show your support by 👏.

Writes on Big Data, AWS & Backend technologies.