Jul 18, 2024

Last updated on Jul 18, 2024

Writing an Interpreter by Hand

Since I started writing code, I have always wondered how interpreters and compilers work, even though I took a compiler design class in college. However, I had never practically worked on anything related to it.

A few days ago, the thought of writing my own language hit me. The next thing I did was open my laptop and write a simple parser to parse 1+2;.

This was going to be a recursive descent parser.

Tokenizer/Lexer

First, I wrote the tokenizer (or lexer), which breaks up the expression into parts. When breaking things into entities, we need a structured way to hold them. After a bit of Googling, I realized that before parsing, we need an Abstract Syntax Tree (AST).

The challenging part of this step was arranging the tokens based on priority. For example, * has higher precedence than +, so we must visit * before +. This can be achieved using recursion (hence the name “recursive descent parser”).

func (a *AstBuilder) Expr() types.Node {
    if a.CurrentTokenPointer >= len(a.tokens) {
        return nil
    }
    left := a.Term()
    switch a.getCurrentToken().TokenType {
    case types.PLUS, types.MINUS, types.EQUALS, types.GREATER, ...:
        for a.getCurrentToken().TokenType == types.PLUS || ... {
            op := a.getCurrentToken().TokenType
            a.eat(op)
            right := a.Term()
            left = types.BinOP{Left: left, Right: right, Op: op}
        }
    }
    return left
}

This function is the first to be called and immediately calls a.Term() to ensure we visit terms before handling + or - operations.

func (a *AstBuilder) Term() types.Node {
    left := a.Factor()
    for a.getCurrentToken().TokenType == types.MULTIPLY || ... {
        op := a.getCurrentToken().TokenType
        a.eat(op)
        right := a.Factor()
        left = types.BinOP{Left: left, Right: right, Op: op}
    }
    return left
}

Similarly, a.Factor() is called within Term() to ensure factors are prioritized over multiplication or division.

func (a *AstBuilder) Factor() types.Node {
    c := a.CurrentTokenPointer
    switch a.tokens[c].TokenType {
    case types.NOT:
        a.eat(types.NOT)
        if a.peek(0).Value.(types.Literal).Type != types.BOOLEAN {
            log.Fatalf("line %d: expected boolean\n", ...)
        }
        return types.UnaryOP{Left: types.NOT, Right: a.Expr()}
    case types.EQUALS, ...:
        a.eat(a.getCurrentToken().TokenType)
        right := a.Expr()
        return right
    }
    return nil
}

This is the endpoint for the recursion. These operations take precedence over Expr and Term cases.

Abstract Syntax Tree (AST)

An AST is a tree representation of the code, where each node represents an entity such as ExpressionNode, StatementNode, FunctionNode, or LiteralNode.

Root Node Structure

type Compound struct {
    Children []Node
}

The root node is a Compound node, meaning it contains multiple child nodes, which can be expressions or statements.

Expression vs. Statement

Expression: Returns a value.
Statement: Does not return a value.

For example, the AST for 1+2; looks like this:

Compound{Children: []Node{
    Literal{Value: 1},
    Op{Value: '+'},
    Literal{Value: 2},
    Literal{Value: ';'}
}}

This represents a binary operation since there are two nodes surrounding the operator.

Evaluating the AST

Evaluation involves visiting the tree and computing the values accordingly. For 1+2;, when encountering +, we already have the left operand and then visit the right operand:

left = 1, right = 2, op = +

We then compute 1 + 2 and return the result.

Variables

Next, I needed to store variables. First, we need to parse assignment statements. I chose the syntax:

a <- b;

It might look unusual, but I liked how the arrow visually represents assignment.

type Assign struct {
    Id    Node
    Value Node
    Type  ASSIGN_TYPE
}

Handling assignment in AST:

left := a.Term()
case types.ASSIGN, types.LOCAL:
    assignType := types.GLOBAL_ASSIGN
    if a.getCurrentToken().TokenType == types.LOCAL {
        a.eat(types.LOCAL)
        assignType = types.LOCAL_ASSIGN
        left = a.Term()
    }

    a.eat(types.ASSIGN)
    right := a.Expr()
    a.eat(types.SEMICOLON)
    return types.Assign {
        Id: left,
        Value: right,
        Type: assignType
    }

Variables are stored in a symbol table:

case types.Assign:
    right := e.Visit(n.Value)
    if n.Type == types.GLOBAL_ASSIGN {
        e.SymbolTable[n.Id.(types.Id).Name] = right
    } else {
        e.LocalSymbolTable[n.Id.(types.Id).Name] = right
    }
    return right

Control Flow

Lark supports simple control flow with if and else:

type IfElseStatement struct {
    Condition    Node
    IfChildren   []Node
    ElseChildren []Node
}

Example:

if (1+2 == 3) <<
    value <- "ok";
>> else <<
    value <- "not ok";
>>

Arrays

Lark supports heterogeneous arrays:

arr <- [1,2];
sum <- arr@0 + arr@1;

type Array struct {
    Name  string
    Value interface{}
    Index int
}

Functions

Functions follow a structured pattern:

type Function struct {
    Name             string
    Arguments        []Node
    Children         []Node
    ReturnExpression Node
    Variables        []string
}

type FunctionCall struct {
    Name      string
    Arguments []Node
}

Example function definition:

fn add[x, y] <<
    return x+y;
>>

Conclusion

This is Lark, a small language I created while exploring interpreters. It is still a work in progress, and I plan to add more features. You can check out the project here.

#others