Crafting Interpreters

This is a book that teaches you to make an interpreter for a programming language. It seemed fun, and I believe I would greatly benefit from learning how languages are made. This page will contain stuff that I find particularly insightful. I will not add any sort of code in here, as this is not the best platform to do so. I might put all of this on GitHub as I progress through it. GitHub link: https://github.com/1shanra1/Lox/tree/master ### My motivations behind reading this book - Understand core CS concepts intuitively - Be prepared in case I am ever asked to make my own scripting language (chances of this are extremely slim, but a need to do so arises often, might as well be prepared) - Have fun (making a programming language is hard and being successful in this endeavour has to be extremely rewarding) - Compiler courses are among the hardest courses that you can take as an undergraduate student. While I don't plan on taking them, I should know what they cover. - The book mentions that we'll write our own implementations of dynamic arrays and hashtables, understand how objects are represented in memory and build our own garbage collectors. Sounds like a lot of fun. There will also be some sections on benchmarking and optimisation. # Factual Content - **YACC** - Tool that takes in a grammar file and produces a source file for a compiler, so it is like a compiler that outputs a compiler. - Stands for "Yet Another Compiler-Compiler" - **Self hosting** - Compiler reads files in one language, translates them, and outputs files in another language. You can implement a compiler in any language, including the same language it compiles. This is known as self hosting. - GCC and LLVM are written in C++, as are most JS VMs. - Language implementations written in C: Lua, CPython, Ruby's MRI. - JS used to be the only way to execute code in a browser. Now you also have the option of WebAssembly - Perl, PHP and Python all started with reference counting as their main garbage collection method but then switched to tracing garbage collection methodologies due to limitations of reference counting - Peter J. Landin coined the term "closure". Most programming terms come out of an incredible paper titled ["The Next 700 Programming Languages" ](https://www.cs.cmu.edu/~crary/819-f09/Landin66.pdf) - **Waterbed Theory, Larry Wall, inventor of Perl** - Some complexity is essential and cannot be eliminated. If you push it down in one place, it swells up in another. - Lox, the language that we build in this book, is not a *pure OOP language*. In true OOP, every object is an instance of a class, even primitive values like numbers and Booleans. - Eric Schmidt was the creator of Lex, a tool that helps you go from regexes -> scanner # Parts of a language #### Scanning Takes in a linear stream of characters and chunks them together into words. Each of these words is called a token. #### Parsing - This step is where the syntax gets a grammar. - Parser takes the flat sequence of tokens and builds a tree structure that mirrors the nested nature of the grammar. These trees are known as parse tree or abstract syntax tree. - Many of the techniques used today to parse programming languages were originally conceived to parse human language by AI researchers trying to get computers to talk to us. #### Static Analysis - Binding/Resolution: for each identifier, we find out where that name is defined and wire the two together. Scope comes into play here. - All the semantic insight that we have gained has to be stored somewhere. There are multiple ways to store this information: - As attributes within the tree structure - In a lookup table (known as the symbol table) where the identifiers are the keys #### Intermediate representation - Think of the compiler as a pipeline where each stage's job is to organize the data representing the user's code in a way that makes the next stage simpler to implement. In the middle, the code might be stored in some "intermediate representation" - There are a few established styles of intermediate representations - Control flow graph - Static single assignment - Continuation passing style - Three address code - The intermediate representation acts as the interface between the source language and the destination language - This is all a very complicated way of saying that this entire process is like a composite function - $y = f(g(x))$ - There can be a lot more functions applied before the end result is achieved - This intermediate representation layer lets you support multiple source languages and target platforms. - An example would be writing implementations of PASCAL, C and FORTRAN compilers and targeting them for x86, ARM, and any other platform. You would have to write 9 full compilers, a shared intermediate representation layer eases this process infinitely. #### Optimization - Once we understand what the user is trying to achieve, we can swap it out with a different, more optimized program. - Some optimization keywords to look into: - Constant propagation - Common subexpression elimination - Loop invariant code motion - Global value numbering - Strength reduction - Scalar replacement of aggregates - Dead code elimination - Loop unrolling #### Code generation - After applying all sorts of optimizations, we convert the user's program to machine code. This is the backend. From here on out, the process is like evolution run in reverse. We go from sophistication to something extremely primitive. - We have two paths: do we generate instructions for a real CPU or a virtual one? - Generating real machine code means that we get an executable that the OS can load directly onto the chip. Generating this code is a lot of work and results in a lot of complications when it comes to portability. - To get around the portability issue, Martin Richards and Niklaus Wirth made their compilers produce virtual machine code. Instead of producing instructions for some real chip, they produced code for a hypothetical, idealized machine. Known as "p-code" for portable. Today, it is known as **bytecode**, each instruction is often a single byte long. #### Virtual Machine - Emulates a hypothetical chip supporting the virtual architecture at runtime - Slower, but more convenient and portable #### Runtime - Actually running the code once it has been changed into a form that we can execute - In a language like Go, each compiled application has its own copy of Go's runtime directly embedded in it. Simpler ways of doing all of these steps: single-pass compilers, tree-walk interpreters, transpilers #### What is the difference between a compiler and an interpreter? Compiling involves translating a source language to another. It does not always execute it. When we say an implementation is an interpreter, we mean it takes in source code and executes it immediately. # Lox features(the language the book helps you design) #### Dynamic Typing - Variables store value of any type - Single variable can store values of diff types at diff times - If you try to do stuff and there is a mismatch in types, it is reported at runtime #### Automatic memory management - Two main techniques for managing memory - Reference counting - Tracing garbage collection (usually garbage collection or GC) - Reference counting is easier to implement (but also limiting) #### Small note regarding the implementation of the `for...in` loop - A `for...in` loop requires some sort of dynamic dispatch in the iterator protocol to handle different kinds of sequences (I have no idea what this means) #### Meaning of "first-class" citizen Real values that you can get a reference to, store in variables, pass around, etc. #### Example of closures ``` fun returnFunction() { var outside = "outside"; fun inner() { print outside; } return inner; } var fn = returnFunction(); fn(); ``` `inner()` has to hold on to references to any surrounding variables that it uses so that they stay around even after the outer function has returned. We call functions that do this "closures". #### Why might any language be object oriented? For a dynamically typed language, objects are pretty handy. You need a way of defining compound data types to bundle blobs of stuff together. #### How do you implement objects? - Two approaches to implementing objects: - Classes - Prototypes - Classes are more common because of C++, Java, C# and more. - Prototypes came into the picture with JS gaining popularity. - Core concepts in class based languages: - Instances - Store the state for each object and have a reference to the instance's class - Classes - Contains methods and the inheritance chain - Core concepts in prototype based languages: - There are only objects, no classes - Each individual object may contain state and methods - Objects can directly inherit from each other ("delegate to", in prototypal lingo) > In practice the line between class-based and prototype-based languages blurs. JavaScript’s “constructor function” notion pushes you pretty hard towards defining class-like objects. Meanwhile, class-based Ruby is perfectly happy to let you attach methods to individual instances. #### General philosophy wrt classes - People naturally seem to prefer a class-based style. Prototypes are simpler but they seem to accomplish that only by pushing the complexity onto the user. - Idea behind OOP is encapsulating behaviour and state together. To do this, you need fields/attributes. - Lox, the language # A Tree-Walk Interpreter ### Error handling - More terms to look up: - Interactive debuggers (is it possible for me to implement one with Lox?) - Static analyzers Ideally, you should have an abstraction layer, something like an "Error Reporter" interface which can get passed to the scanner and parser so that we can swap diff reporting strats. ### Lexemes and Tokens ``` var language = "lox"; ``` The lexemes here are: var, language, =, "lox" and ; When the lexeme is bundled together with other useful data, the result is a token. ### Regular languages and expressions The rules that determine how a particular language groups characters into lexemes are called its lexical grammar. We can perform lexical analysis using regex expressions for the language that we are about to develop, but we will not be doing so since our main aim is to implement these things from scratch. ### Implementing the Scanner - The author mentions that static imports are considered bad style -- why is this? Must research further. Link: [[Questions|questions page]] **Maximal Munch**: When two lexical grammar rules can both match a chunk of code that a scanner is looking at, whichever one matches the most characters wins. - The lexical grammars of Python and Haskell are not regular. What does this mean? Link: [[Questions|questions page]] - Space affects how the code is parsed in languages like CoffeeScript, Ruby, and the C preprocessor. How exactly and why? Link: [[Questions|questions page]] So far, I have created a high level interpretation of the source code. Every lexeme in it has been converted to a token. The next step is to create a parser that converts it into another representation. # Representing Code ### Context free grammar