Objectives

What are we looking to do here?

See the core parts involved in a language/tool

Examine some of the options available in building those core parts

Discuss some of the tradeoffs and consequences

Encourage you to build your own language!

Building a language?

Why on earth would anyone bother?

Motivation

Why bother learning how compilers are built?

it is commonly considered a "classic" of Computer Science

good craftsman knows his tools; compilers are our tools

compiler construction techniques are useful for other purposes

Problem

Consider a simple state machine

we can certainly track this with State pattern, typically modeled with Event, Transition, and Command objects

how much code will this occupy?

how much code required to create/build one?

who maintains it?

Problem

Consider the traditional "business rules"

some rules are endemic to the organization and must always be applied (until the business changes)

some rules are temporary and short-lived, due to changes in economy, laws, business partnerships, etc

how do we distinguish between the two?

how much code is involved in each case?

who maintains it?

Problem

Consider CS problems we haven't solved yet

concurrency

user interface

application security

object-level/fine-grained security--

how much code is required to implement this?

how well segregated is it from the "real" problem?

who maintains this?

Trends

Within the last few years, "Domain-Specific Languages" have become a new area of inquiry

external DSL: write your own traditional language

internal DSL: write the language "inside" another

write languages for your users!

But let's not abandon lines of general-purpose language research, either

the prevalence of virtual machines makes it approachable

write languages for your own programmers!

Trends

Lots of tools are already "languages", "DSLs", or related to languages in some way

build tools

virtual machines

database query engines

code generation tools

output (HTML) template engines

field input and/or data validators

static code analysis tools

"little languages" (sed, awk, regex, ...)

the whole family of XML-related tools

Language Design

Creating a custom programming language

Language Design

Creating a custom programming language/DSL

lexing and parsing

intermediate representation

analysis

execution and/or code-generation

Lexing and Parsing

Transform input characters into meaning

Lexing: transform input into symbols

Parsing: transform symbols into meaning

End result: Intermediate Representation

Intermediate Representation

Often called an abstract syntax tree (AST)

This format is the halfway-step between source and artifact

Useful for analysis (next) and optimization

Sometimes there may be several "levels"

"High IR": conceptual, abstracted

"Low IR": close to the target output

Analysis

Examining for optimization opportunities

Examining for manipulation (aspect-oriented, for example)

Will sometimes occur repeatedly (phases)

Execution

These are the interpreters

Each node of the IR is executed directly

Allows for some interesting runtime manipulation of IR

Also tends to be slower, more overhead

Code Generation

These are the compilers

Transform IR into directly-executable output

"Directly executable" = PE, COFF, CLR, JVM

"Directly executable" = JavaScript (transpiling)

Or transpilers

Transform IR into other language source

C, Javascript, assembler are popular targets

Lexing

From letters to words

Lexing

Lexing is a first-pass verification of input stream

looking for obvious mistakes

typos

incorrect syntax

malformed strings

... and so on

saves a degree of complexity in the parser

Lexing

Lexing takes characters and produces "tokens"

input format: characters (or other atoms)

"characters" are all acceptable language characters

lexer verifies they are in "sensical" usage

identifies the kind of usage

output format: stream of recognized "tokens"

keywords

symbols

(potential) identifiers

numerical constants

Lexing

Lexing examples (Java)

"public ": keyword

"pbulic ": potential identifier

"45.7 ": floating-point constant

"45. ": error!

Parsing

From words to sentences with meaning

Parsing

Parsing takes input format and transforms it into an intermediate representation

typically done to avoid re-parsing or direct analysis of the input format

typically the input format is text

... but not always!

... and doesn't have to be!

typically broken into two parts: lexing and parsing

lexing transforms characters/atoms into tokens

parsing verifies that tokens are appearing sanely

Parsing

Examples using English

lexing errors:

yg thgmnoon etwo flai rnubasdr

parsing errors:

yogurt thank moon network fail burn mini

Parsing

Much of the time, an input format is described by a formal grammar

grammar describes production rules that consist of possible successful parse steps, in a top-to-bottom fashion

formal grammar is often used as an input format to a parser or parser generator (code generator)

"EBNF" is one such grammar language (and popular)

"shorthand" versions of EBNF are often used in informal descriptions/prose

Parsing

4-operation calculator grammar (EBNF)


input	::= ws expr ws;

expr	::= ws factor [{ws ('*'|'/') ws factor}];
factor	::= ws term [{ws ('+'|'-') ws term}];
term	::= '(' ws expr ws ')' | '-' ws expr | number;

number	::= {dgt} ['.' {dgt}] [('e'|'E') ['-'] {dgt}];
dgt	::= '0'|'1'|'2'|'3'|'4'|'5'|'6'|'7'|'8'|'9';
ws	::= [{' '|'\t'|'\n'|'\r'}];

Parsing

Sometimes the grammar is fed as input to a code-generator tool to generate the code for the lexer and parser

traditional tools are lex/yacc to generate C code, and later GNU versions flex/bison

ANTLR generates C, C++, Java, C#, and others

other parser generator tools abound

Parsing

Sometimes the parsing can be written as a library within a language, fed by Strings read in by the host language

for simple input formats, this can be as easy as regex

XML is another such example format

for complex input formats, parser combinators work with functional languages to simplify the parsing

Parsing

Sometimes we can duck the parser altogether and use another programming language to be the parsed input format for us

Fowler calls this an internal DSL

requires the host language to be extremely flexible in its syntax constructs

or else requires the input format to be similar in format to the host language's input format

Parsing

Parser implementation choices

Internal Language Translation

Ruby, Groovy, Scala, F#, Boo, ...

XML-Directed Translation

Ant, Spring config, etc

Delimiter-Directed Translation

whitespace or comma-delimited parsing

Syntax-Directed Translation (code-gen)

typically created using parser generator toolkit (ANTLR, Gold, lex/yacc, etc)

Parser Combinators/Parser Expression Grammars

functional languages excel here

Intermediate Representation

An optimized form of the language input

Abstract Syntax Tree

Abstract Syntax Tree (AST)

tree of elements making up internal representation of program structure

AST serves as the base for all operations:

verify correctness of structure

apply optimizations

apply transformations/manipulations

Intermediate Representation

Intermediate Representation (IR) is an optimized format of the input

frequently the AST is transformed into an IR as an optimization

IR can be tied closely to the final desired output

IR can be simple tree structures ("nodes") annotated with additional metadata/information about the nodes

some environments actually move through multiple IR formats during the process

start with a "High IR" extremely abstract and/or conceptual in nature, then move to a "Low IR" that more closely matches the final result format

if it's not in the IR, it doesn't exist

Analysis

Transformations on the IR

Analysis

Once the input form is in an IR, optimizations can be made to improve the efficiency of the code on a variety of different levels

constant propagation

loop unrolling

inline code expansion

semantic macros

Execution

Directly executing the intermediate rep

Execution

IR structures can be executed/interpreted by a host environment

typically this code must be written from scratch

notable exceptions:

dynalang (JVM)

DLR (CLR)

If the input format is simple enough, a syntax-directed interpreter can execute code directly off of the parsed input (no IR)

Interpreter

interpreter is usually for dynamic languages or languages without a compiled form

frequently easier to implement

often driven directly off the AST

sometimes/often lower performance than compiled/code-genned equivalent

Code Generation

To make it faster/standalone

Code generation

Produce code output for execution

bytecode for a particular virtual machine

JVM, CLR, LLVM, Parrot, ...

native code for a particular CPU/environment

language source for portability or bootstrapping

"transpilation"

C is/was a popular choice for this

Javascript everybody's favorite target today

some early JVM languages generated Java

get halfway to native code by generating CPU asm

Code generation

Code generation implementation tools

CCI: Common Compiler Infrastructure (CLR)

Truffle (Graal/JVM)

Soot (JVM)

StringTemplate (ANTLR)

for string-based templatized generation

ASM, BCEL bytecode toolkits (JVM)

Cecil, PERAPI bytecode toolkits (CLR)

Example: Calc

A calculator 'language'

Example: Calc

Example: A calculator language (Calc)

simple four-operation calculator

Java implementation

ANTLR4 grammar

simple variables support

interpreted (not compiled)

Example: Calc

Example calculator script

<</home/runner/work/Slides/Slides/Content/LanguageDesign/Calc/code/test.expr NOT FOUND>>

Example results

193
17
9

Example: Calc

Lexing

we have simple lexing requirements:

numbers

characters (for variables)

math operation characters (+, -, *, /)

parentheses (for order of operations)

ANTLR combines lexer and parser into one .g4 file

Example: Calc

Labeled Expressions grammar

<</home/runner/work/Slides/Slides/Content/LanguageDesign/Calc/code/src/main/antlr/LabeledExpr.g4 NOT FOUND>>

Example: Calc

Parsing

a calc 'program' is made up of a series of statements

a statement is either a mathematical expression...

... or an identifier statement/assignment

expressions are mathematical ops (and can nest)

Example: Calc

Labeled Expressions grammar

<</home/runner/work/Slides/Slides/Content/LanguageDesign/Calc/code/src/main/antlr/LabeledExpr.g4 NOT FOUND>>

Example: Calc

AST

in this case, we can simply reuse the ANTLR-generated tree

ANTLR generates Visitor-based implementation we can use

for future versions, we can extend the AST

add additional type information

or use the AST as source for a different representation

something like an IR (either a "high IR" or "low IR")

Example: Calc

Interpretation

ANTLR provides a Visitor interface/implementation

generated based on the grammar

it visits each node of the parse tree/AST

context is kept by way of a HashMap of named values

these are the identifiers and their running data

in some languages, this is the "Environment"

each call corresponds to a node in the tree

each call receives a "context" referencing that node

containing information from the parse for that node

Example: Calc

EvalVisitor -- an interpreter

<</home/runner/work/Slides/Slides/Content/LanguageDesign/Calc/code/src/main/java/EvalVisitor.java NOT FOUND>>

Example: Calc

EvalVisitor -- assignment

<</home/runner/work/Slides/Slides/Content/LanguageDesign/Calc/code/src/main/java/EvalVisitor.java NOT FOUND>>

Example: Calc

EvalVisitor -- printing expressions

<</home/runner/work/Slides/Slides/Content/LanguageDesign/Calc/code/src/main/java/EvalVisitor.java NOT FOUND>>

Example: Calc

EvalVisitor -- constants and identifiers

<</home/runner/work/Slides/Slides/Content/LanguageDesign/Calc/code/src/main/java/EvalVisitor.java NOT FOUND>>

Example: Calc

EvalVisitor -- mulDiv and addSub nodes

<</home/runner/work/Slides/Slides/Content/LanguageDesign/Calc/code/src/main/java/EvalVisitor.java NOT FOUND>>

Example: Calc

EvalVisitor -- parentheses (nested expressions)

<</home/runner/work/Slides/Slides/Content/LanguageDesign/Calc/code/src/main/java/EvalVisitor.java NOT FOUND>>

Example: Calc

Lastly, a driver to control the whole pipeline

construct the pipeline

AntlrInputStream -- the source

LabeledExprLexer -- the (generated) lexer

CommonTokenStream -- the (generated) token stream

LabeledExprParser -- the (generated) parser

ParseTree -- the tree built by the parser

find the file, load it, feed it, visit the result

Summary

Building a language need not be hard:

getting the parser in place is usually the starting point

if it's a simple interpreter, that can be sufficient

if you want a moderately complex language, convert the parse tree into an IR

if you want compiled code, transform the IR into executable bytecode

Lots of tools can be seen as "languages"

bytecode analysis is just parsing bytecode-as-an-input

a virtual machine is parsing binary input form and either generating executable code (possibly on the fly) or directly interpreting

Resources

Books

The Definitive Guide to ANTLR4 (Parr)

Language Implementation Patterns (Parr)

Programming Language Pragmatics

Compilers (2/E) (Aho, Lam, Sethi, Ullman)

"The Dragon Book", a classic

Resources

Tools

ANTLR

http://www.antlr.org

http://tech.puredanger.com/2007/01/13/implementing-a-scripting-language-with-antlr-part-1-lexer/

lex, yacc, SableCC, JLex/JYacc, others

LLVM

Credentials

Who is this guy?

Architect, Engineering Manager/Leader, "force multiplier"

http://www.newardassociates.com

http://blogs.newardassociates.com

Sr Distinguished Engineer, Capital One

Educative (http://educative.io) Author

Performance Management for Engineering Managers

Books

Developer Relations Activity Patterns (w/Woodruff, et al; APress, forthcoming)

Professional F# 2.0 (w/Erickson, et al; Wrox, 2010)

Effective Enterprise Java (Addison-Wesley, 2004)

SSCLI Essentials (w/Stutz, et al; OReilly, 2003)

Server-Based Java Programming (Manning, 2000)

Busy Developer's Guide to Building A Programming Language

Objectives

What are we looking to do here?

Building a language?

Why on earth would anyone bother?

Motivation

Problem

Problem

Problem

Trends

Trends

Language Design

Creating a custom programming language

Language Design

Lexing and Parsing

Intermediate Representation

Analysis

Execution

Code Generation

Lexing

From letters to words

Lexing

Lexing

Lexing

Parsing

From words to sentences with meaning

Parsing

Parsing takes input format and transforms it into an intermediate representation

Parsing

Examples using English

Parsing

Much of the time, an input format is described by a formal grammar

Parsing

Parsing

Sometimes the grammar is fed as input to a code-generator tool to generate the code for the lexer and parser

Parsing

Sometimes the parsing can be written as a library within a language, fed by Strings read in by the host language

Parsing

Sometimes we can duck the parser altogether and use another programming language to be the parsed input format for us

Parsing

Parser implementation choices

Intermediate Representation

An optimized form of the language input

Abstract Syntax Tree

Intermediate Representation

Analysis

Transformations on the IR

Analysis

Execution

Directly executing the intermediate rep

Execution

Interpreter

Code Generation

To make it faster/standalone

Code generation

Code generation

Example: Calc

A calculator 'language'

Example: Calc

Example: Calc

Example: Calc

Example: Calc

Example: Calc

Example: Calc

Example: Calc

Example: Calc

Example: Calc

Example: Calc

Example: Calc

Example: Calc

Example: Calc

Example: Calc

Example: Calc

Summary

Resources

Resources

Credentials

Who is this guy?