Busy Developer's Guide to Building A Programming Language

ted@tedneward.com | Blog: http://blogs.tedneward.com | Twitter: tedneward | Github: tedneward | LinkedIn: tedneward

Objectives

What are we looking to do here?

Building a language?

Why on earth would anyone bother?

Motivation

Why bother learning how compilers are built?

Problem

Consider a simple state machine

Problem

Consider the traditional "business rules"

Problem

Consider CS problems we haven't solved yet

Trends

Within the last few years, "Domain-Specific Languages" have become a new area of inquiry

Trends

Lots of tools are already "languages", "DSLs", or related to languages in some way

Language Design

Creating a custom programming language

Language Design

Creating a custom programming language/DSL

Lexing and Parsing

Lexing and Parsing

Intermediate Representation

Intermediate Representation

Analysis

Analysis

Execution

Execution

Code Generation

Code Generation

Lexing

From letters to words

Lexing

Lexing is a first-pass verification of input stream

Lexing

Lexing takes characters and produces "tokens"

Lexing

Lexing examples (Java)

Parsing

From words to sentences with meaning

Parsing

Parsing takes input format and transforms it into an intermediate representation

Parsing

Examples using English

Parsing

Much of the time, an input format is described by a formal grammar

Parsing

4-operation calculator grammar (EBNF)


input	::= ws expr ws;

expr	::= ws factor [{ws ('*'|'/') ws factor}];
factor	::= ws term [{ws ('+'|'-') ws term}];
term	::= '(' ws expr ws ')' | '-' ws expr | number;

number	::= {dgt} ['.' {dgt}] [('e'|'E') ['-'] {dgt}];
dgt	::= '0'|'1'|'2'|'3'|'4'|'5'|'6'|'7'|'8'|'9';
ws	::= [{' '|'\t'|'\n'|'\r'}];
  

Parsing

Sometimes the grammar is fed as input to a code-generator tool to generate the code for the lexer and parser

Parsing

Sometimes the parsing can be written as a library within a language, fed by Strings read in by the host language

Parsing

Sometimes we can duck the parser altogether and use another programming language to be the parsed input format for us

Parsing

Parser implementation choices

Intermediate Representation

An optimized form of the language input

Abstract Syntax Tree

Abstract Syntax Tree (AST)

Intermediate Representation

Intermediate Representation (IR) is an optimized format of the input

Analysis

Transformations on the IR

Analysis

Once the input form is in an IR, optimizations can be made to improve the efficiency of the code on a variety of different levels

Execution

Directly executing the intermediate rep

Execution

IR structures can be executed/interpreted by a host environment

Interpreter

Interpreter

Code Generation

To make it faster/standalone

Code generation

Produce code output for execution

Code generation

Code generation implementation tools

Example: Calc

A calculator 'language'

Example: Calc

Example: A calculator language (Calc)

Example: Calc

Example calculator script

193
a = 5
b = 6
a+b*2
(1+2)*3

Example results

193
17
9

Example: Calc

Lexing

Example: Calc

Labeled Expressions grammar

MUL :   '*' ; // assigns token name to '*' used in grammar
DIV :   '/' ;
ADD :   '+' ;
SUB :   '-' ;
MOD :   '%' ;
ID  :   [a-zA-Z]+ ;      // match identifiers
INT :   [0-9]+ ;         // match integers
NEWLINE:'\r'? '\n' ;     // return newlines to parser (is end-statement signal)
WS  :   [ \t]+ -> skip ; // toss out whitespace

Example: Calc

Parsing

Example: Calc

Labeled Expressions grammar

grammar LabeledExpr;
    
prog:   stat+ ;
    
stat:   expr NEWLINE                # printExpr
    |   ID '=' expr NEWLINE         # assign
    |   NEWLINE                     # blank
    ;
    
expr:   expr op=('*'|'/'|'%') expr      # MulDiv
    |   expr op=('+'|'-') expr      # AddSub
    |   INT                         # int
    |   ID                          # id
    |   '(' expr ')'                # parens
    ;

Example: Calc

AST

Example: Calc

Interpretation

Example: Calc

EvalVisitor -- an interpreter

public class EvalVisitor extends LabeledExprBaseVisitor<Integer> {
    /** "memory" for our calculator; variable/value pairs go here */
    Map<String, Integer> memory = new HashMap<String, Integer>();

Example: Calc

EvalVisitor -- assignment

    /** ID '=' expr NEWLINE */
    @Override
    public Integer visitAssign(LabeledExprParser.AssignContext ctx) {
        String id = ctx.ID().getText();  // id is left-hand side of '='
        int value = visit(ctx.expr());   // compute expression
        memory.put(id, value);           // store it in our memory
        return value;
    }

Example: Calc

EvalVisitor -- printing expressions

    /** expr NEWLINE */
    @Override
    public Integer visitPrintExpr(LabeledExprParser.PrintExprContext ctx) {
        Integer value = visit(ctx.expr()); // evaluate the expr child
        System.out.println(value);         // print the result
        return 0;                          // return dummy value
    }

Example: Calc

EvalVisitor -- constants and identifiers

    /** INT */
    @Override
    public Integer visitInt(LabeledExprParser.IntContext ctx) {
        return Integer.valueOf(ctx.INT().getText());
    }
    
    /** ID */
    @Override
    public Integer visitId(LabeledExprParser.IdContext ctx) {
        String id = ctx.ID().getText();
        if ( memory.containsKey(id) ) return memory.get(id);
        return 0;
    }

Example: Calc

EvalVisitor -- mulDiv and addSub nodes

    /** expr op=('*'|'/'|'%') expr */
    @Override
    public Integer visitMulDiv(LabeledExprParser.MulDivContext ctx) {
        int l = visit(ctx.expr(0)); int r = visit(ctx.expr(1));
        if (ctx.op.getType() == LabeledExprParser.MUL) 
            return l * r;
        else  if (ctx.op.getType() == LabeledExprParser.DIV) 
            return l / r;
        return l % r; // must be MOD
    }
    /** expr op=('+'|'-') expr */
    @Override
    public Integer visitAddSub(LabeledExprParser.AddSubContext ctx) {
        int l = visit(ctx.expr(0));  int r = visit(ctx.expr(1));
        if (ctx.op.getType() == LabeledExprParser.ADD) 
            return l + r;
        return l - r; // must be SUB
    }

Example: Calc

EvalVisitor -- parentheses (nested expressions)

    /** '(' expr ')' */
    @Override
    public Integer visitParens(LabeledExprParser.ParensContext ctx) {
        return visit(ctx.expr()); // return child expr's value
    }

Example: Calc

Lastly, a driver to control the whole pipeline

Summary

Resources

Books

Resources

Tools

Credentials

Who is this guy?