See the core parts involved in a language/tool
Examine some of the options available in building those core parts
Discuss some of the tradeoffs and consequences
Encourage you to build your own language!
Why bother learning how compilers are built?
it is commonly considered a "classic" of Computer Science
good craftsman knows his tools; compilers are our tools
compiler construction techniques are useful for other purposes
Consider a simple state machine
we can certainly track this with State pattern, typically modeled with Event, Transition, and Command objects
how much code will this occupy?
how much code required to create/build one?
who maintains it?
Consider the traditional "business rules"
some rules are endemic to the organization and must always be applied (until the business changes)
some rules are temporary and short-lived, due to changes in economy, laws, business partnerships, etc
how do we distinguish between the two?
how much code is involved in each case?
who maintains it?
Consider CS problems we haven't solved yet
concurrency
user interface
application security
object-level/fine-grained security--
how much code is required to implement this?
how well segregated is it from the "real" problem?
who maintains this?
Within the last few years, "Domain-Specific Languages" have become a new area of inquiry
external DSL: write your own traditional language
internal DSL: write the language "inside" another
write languages for your users!
But let's not abandon lines of general-purpose language research, either
the prevalence of virtual machines makes it approachable
write languages for your own programmers!
Lots of tools are already "languages", "DSLs", or related to languages in some way
build tools
virtual machines
database query engines
code generation tools
output (HTML) template engines
field input and/or data validators
static code analysis tools
"little languages" (sed, awk, regex, ...)
the whole family of XML-related tools
Creating a custom programming language/DSL
lexing and parsing
intermediate representation
analysis
execution and/or code-generation
Lexing and Parsing
Transform input characters into meaning
Lexing: transform input into symbols
Parsing: transform symbols into meaning
End result: Intermediate Representation
Intermediate Representation
Often called an abstract syntax tree (AST)
This format is the halfway-step between source and artifact
Useful for analysis (next) and optimization
Sometimes there may be several "levels"
"High IR": conceptual, abstracted
"Low IR": close to the target output
Analysis
Examining for optimization opportunities
Examining for manipulation (aspect-oriented, for example)
Will sometimes occur repeatedly (phases)
Execution
These are the interpreters
Each node of the IR is executed directly
Allows for some interesting runtime manipulation of IR
Also tends to be slower, more overhead
Code Generation
These are the compilers
Transform IR into directly-executable output
"Directly executable" = PE, COFF, CLR, JVM
"Directly executable" = JavaScript (transpiling)
Or transpilers
Transform IR into other language source
C, Javascript, assembler are popular targets
Lexing is a first-pass verification of input stream
looking for obvious mistakes
typos
incorrect syntax
malformed strings
... and so on
saves a degree of complexity in the parser
Lexing takes characters and produces "tokens"
input format: characters (or other atoms)
"characters" are all acceptable language characters
lexer verifies they are in "sensical" usage
identifies the kind of usage
output format: stream of recognized "tokens"
keywords
symbols
(potential) identifiers
numerical constants
Lexing examples (Java)
"public ": keyword
"pbulic ": potential identifier
"45.7 ": floating-point constant
"45. ": error!
typically done to avoid re-parsing or direct analysis of the input format
typically the input format is text
... but not always!
... and doesn't have to be!
typically broken into two parts: lexing and parsing
lexing transforms characters/atoms into tokens
parsing verifies that tokens are appearing sanely
lexing errors:
yg thgmnoon etwo flai rnubasdr
parsing errors:
yogurt thank moon network fail burn mini
grammar describes production rules that consist of possible successful parse steps, in a top-to-bottom fashion
formal grammar is often used as an input format to a parser or parser generator (code generator)
"EBNF" is one such grammar language (and popular)
"shorthand" versions of EBNF are often used in informal descriptions/prose
4-operation calculator grammar (EBNF)
input ::= ws expr ws; expr ::= ws factor [{ws ('*'|'/') ws factor}]; factor ::= ws term [{ws ('+'|'-') ws term}]; term ::= '(' ws expr ws ')' | '-' ws expr | number; number ::= {dgt} ['.' {dgt}] [('e'|'E') ['-'] {dgt}]; dgt ::= '0'|'1'|'2'|'3'|'4'|'5'|'6'|'7'|'8'|'9'; ws ::= [{' '|'\t'|'\n'|'\r'}];
traditional tools are lex/yacc to generate C code, and later GNU versions flex/bison
ANTLR generates C, C++, Java, C#, and others
other parser generator tools abound
for simple input formats, this can be as easy as regex
XML is another such example format
for complex input formats, parser combinators work with functional languages to simplify the parsing
Fowler calls this an internal DSL
requires the host language to be extremely flexible in its syntax constructs
or else requires the input format to be similar in format to the host language's input format
Internal Language Translation
Ruby, Groovy, Scala, F#, Boo, ...
XML-Directed Translation
Ant, Spring config, etc
Delimiter-Directed Translation
whitespace or comma-delimited parsing
Syntax-Directed Translation (code-gen)
typically created using parser generator toolkit (ANTLR, Gold, lex/yacc, etc)
Parser Combinators/Parser Expression Grammars
functional languages excel here
Abstract Syntax Tree (AST)
tree of elements making up internal representation of program structure
AST serves as the base for all operations:
verify correctness of structure
apply optimizations
apply transformations/manipulations
Intermediate Representation (IR) is an optimized format of the input
frequently the AST is transformed into an IR as an optimization
IR can be tied closely to the final desired output
IR can be simple tree structures ("nodes") annotated with additional metadata/information about the nodes
some environments actually move through multiple IR formats during the process
start with a "High IR" extremely abstract and/or conceptual in nature, then move to a "Low IR" that more closely matches the final result format
if it's not in the IR, it doesn't exist
Once the input form is in an IR, optimizations can be made to improve the efficiency of the code on a variety of different levels
constant propagation
loop unrolling
inline code expansion
semantic macros
IR structures can be executed/interpreted by a host environment
typically this code must be written from scratch
notable exceptions:
dynalang (JVM)
DLR (CLR)
If the input format is simple enough, a syntax-directed interpreter can execute code directly off of the parsed input (no IR)
Interpreter
interpreter is usually for dynamic languages or languages without a compiled form
frequently easier to implement
often driven directly off the AST
sometimes/often lower performance than compiled/code-genned equivalent
Produce code output for execution
bytecode for a particular virtual machine
JVM, CLR, LLVM, Parrot, ...
native code for a particular CPU/environment
language source for portability or bootstrapping
"transpilation"
C is/was a popular choice for this
Javascript everybody's favorite target today
some early JVM languages generated Java
get halfway to native code by generating CPU asm
Code generation implementation tools
CCI: Common Compiler Infrastructure (CLR)
Truffle (Graal/JVM)
Soot (JVM)
StringTemplate (ANTLR)
for string-based templatized generation
ASM, BCEL bytecode toolkits (JVM)
Cecil, PERAPI bytecode toolkits (CLR)
Example: A calculator language (Calc)
simple four-operation calculator
Java implementation
ANTLR4 grammar
simple variables support
interpreted (not compiled)
Example calculator script
<</home/runner/work/Slides/Slides/Content/LanguageDesign/Calc/code/test.expr NOT FOUND>>
Example results
193 17 9
Lexing
we have simple lexing requirements:
numbers
characters (for variables)
math operation characters (+, -, *, /)
parentheses (for order of operations)
ANTLR combines lexer and parser into one .g4 file
Labeled Expressions grammar
<</home/runner/work/Slides/Slides/Content/LanguageDesign/Calc/code/src/main/antlr/LabeledExpr.g4 NOT FOUND>>
Parsing
a calc 'program' is made up of a series of statements
a statement is either a mathematical expression...
... or an identifier statement/assignment
expressions are mathematical ops (and can nest)
Labeled Expressions grammar
<</home/runner/work/Slides/Slides/Content/LanguageDesign/Calc/code/src/main/antlr/LabeledExpr.g4 NOT FOUND>>
AST
in this case, we can simply reuse the ANTLR-generated tree
ANTLR generates Visitor-based implementation we can use
for future versions, we can extend the AST
add additional type information
or use the AST as source for a different representation
something like an IR (either a "high IR" or "low IR")
Interpretation
ANTLR provides a Visitor interface/implementation
generated based on the grammar
it visits each node of the parse tree/AST
context is kept by way of a HashMap of named values
these are the identifiers and their running data
in some languages, this is the "Environment"
each call corresponds to a node in the tree
each call receives a "context" referencing that node
containing information from the parse for that node
EvalVisitor -- an interpreter
<</home/runner/work/Slides/Slides/Content/LanguageDesign/Calc/code/src/main/java/EvalVisitor.java NOT FOUND>>
EvalVisitor -- assignment
<</home/runner/work/Slides/Slides/Content/LanguageDesign/Calc/code/src/main/java/EvalVisitor.java NOT FOUND>>
EvalVisitor -- printing expressions
<</home/runner/work/Slides/Slides/Content/LanguageDesign/Calc/code/src/main/java/EvalVisitor.java NOT FOUND>>
EvalVisitor -- constants and identifiers
<</home/runner/work/Slides/Slides/Content/LanguageDesign/Calc/code/src/main/java/EvalVisitor.java NOT FOUND>>
EvalVisitor -- mulDiv and addSub nodes
<</home/runner/work/Slides/Slides/Content/LanguageDesign/Calc/code/src/main/java/EvalVisitor.java NOT FOUND>>
EvalVisitor -- parentheses (nested expressions)
<</home/runner/work/Slides/Slides/Content/LanguageDesign/Calc/code/src/main/java/EvalVisitor.java NOT FOUND>>
Lastly, a driver to control the whole pipeline
construct the pipeline
AntlrInputStream -- the source
LabeledExprLexer -- the (generated) lexer
CommonTokenStream -- the (generated) token stream
LabeledExprParser -- the (generated) parser
ParseTree -- the tree built by the parser
find the file, load it, feed it, visit the result
Building a language need not be hard:
getting the parser in place is usually the starting point
if it's a simple interpreter, that can be sufficient
if you want a moderately complex language, convert the parse tree into an IR
if you want compiled code, transform the IR into executable bytecode
Lots of tools can be seen as "languages"
bytecode analysis is just parsing bytecode-as-an-input
a virtual machine is parsing binary input form and either generating executable code (possibly on the fly) or directly interpreting
Books
The Definitive Guide to ANTLR4 (Parr)
Language Implementation Patterns (Parr)
Programming Language Pragmatics
Compilers (2/E) (Aho, Lam, Sethi, Ullman)
"The Dragon Book", a classic
Tools
ANTLR
http://www.antlr.org
http://tech.puredanger.com/2007/01/13/implementing-a-scripting-language-with-antlr-part-1-lexer/
lex, yacc, SableCC, JLex/JYacc, others
LLVM
Who is this guy?
Architect, Engineering Manager/Leader, "force multiplier"
Principal -- Neward & Associates
http://www.newardassociates.com
Educative (http://educative.io) Author
Performance Management for Engineering Managers
Author
Professional F# 2.0 (w/Erickson, et al; Wrox, 2010)
Effective Enterprise Java (Addison-Wesley, 2004)
SSCLI Essentials (w/Stutz, et al; OReilly, 2003)
Server-Based Java Programming (Manning, 2000)