Are you googling the questions “How to create a compiler in Java?”, “Tiny compiler in Java?”, “AST to Java bytecode”. Then you are in the right place. The word Tiny is subjective. But yeah, the code is simple enough to understand the end-to-end flow of compiler development.
Talk is cheap. Show me the code!
There you go: Codekrypt Compiler Github
- Lexical Analysis [String → Token ]
- Syntactic Analysis (ie Parsing) [ Token → AST ]
- Semantic Analysis [Validating AST]
- Optimization (Optional)
- Code Generation [AST → Java bytecode]
To keep it dead simple, we will be using the below grammar.
Programwill have multiple
- Let is of the form
VAR = INT.
- Show is of the form
- VAR → Variable ( String with lower or uppercase letters)
- INT → ( Integer, ie
Positive Numberwithout decimal)
This grammar was taken from another article on ANTRL4.
Let's start by implementing the Compiler.
1. Lexical Analysis (Tokenizer)
nextToken() → We iterate through each character and see if it can be converted to a Token.
getCurrentToken() → We fetch the current token which was set, after invoking
Our Token will hold the
2. Parser + AST
Simply put, the parser is all about populating this below skeleton class using
The important point to note is
TerminalNode are of children of Visitable and ParseTree.
Visitable → To accept custom Visitor for adding business logic on each node.
ParseTree → For traversing the child nodes and propagating Visitor.
We create a base class (
ParserRuleContext)with concretion on common methods, that would be extended by their child classes.
The Grammer elements (ie AST Nodes) extend from this base class.
Now the child nodes (ie implementations) of AST will only have the relevant variables and functions. Let's see an example of
LetContext node is only responsible for handling
TerminalNodehas a slightly different logic since they don’t have children.
3. Visitor & Semantics
Yes! Visitor, it is. We will be using Visitor in the next 3 stages.
Syntax validates “Arjun 1234 good”. (Is it a valid sentance?)
Semantics validates “Arjun The good is”. (Is it meaningful?)
In our case, semantically speaking, we need to validate if the variable is declared (
VAR1 = 10) before it is referenced in
This validation logic can be implemented using the SemanticAnalyzer.
4. Visitor & Interpreter.
The interpreter runs your code line by line.
We will implement a visitor that handles
ShowContextto print the output of our code.
5. Visitor & Code Generation (Java ASM)
This part will be a bit difficult, but if you have an idea of using Java ASM, then it will be really simple. In this stage, we will be converting our AST to Java bytecode.
The easiest way to generate the ASM code for your class is using ASMifier.
visitProgram() we open the
main Method Visitor are opened, we invoke
super() to propagate the call to child nodes, ie
VisitShow(), before closing these writers.
visitLet() we use Java Ops Codes:
VariableIndexto store reference into a local variable.
Note 1: VariableIndex starts from 1 since index 0 is reserved for args in :
main(String var0) .
Note 2: We save
VariableIndex for further referencing in
visitShow(), we use Java Ops code:
VariableIndex: to load reference from the local variable (VAR)
BIPUSH(Byte Push): If it is an integer constant. (INT)
INVOKEVIRTUAL: to invoke
6. Chaining and compiling.
Java code from the
.class file generated, use IntelliJ’s decompiler.
You made it to the end. Cheers🍻!
Java is verbose and it is really difficult to explain the code, line by line. But I have tried my best to give a high-level overview of the Compiler Development process.
Please do check out the complete code in Github. Do star the project, since I may be updating this with a more mature Grammer going forward. The parent project contains examples of the
parser-libraries that I have tried out. Hope this article helps someone!
Found it Interesting?
Please show your support by 👏.