I think Open Source is an incredible resource for learning. There is a bazillion lines of code on GitHub doing all kinds of things. Investing time into reading some of it has proven to be extremely valuable to me. Currently I am focusing on Babel, the JavaScript transpiler 1. Today I spent some time looking into babel-generator 2 which is responsible for turning an abstract syntax tree (AST) into JavaScript source code. This is the last step that Babel runs after all code transformations have been applied. Going through this process should not be terribly difficult since the abstract syntax tree contains all the information about the program, right? When I first learned about ASTs I spent some time on astexplorer.net3 observing this transformation. It lets you write JavaScript code and see the corresponding AST right next to it. After going through some examples I realized that printing small things like individual expressions is quite easy. All you have to know is how the node type encodes the information about the aspects of the syntactic element and how to express it in JavaScript syntax. But reading into the source code of the generator I found some very interesting aspects of the code generation process. For this post I picked one particular example that I will go through in detail now.

Naively printing an expression based on its AST representation

The input that is fed into babel-generator is an AST. Consider the following AST:

{
	type: "BinaryExpression",
	operator: "+",
	left: {
		type: "Identifier",
		name: "a"
	},
	right: {
		type: "UpdateExpression",
		operator: "++",
		prefix: true,
		argument: {
			type: "Identifier",
			name: "b"
		}
	}
}

Now I want to figure out the source code for this. I guess my first naive approach would be to walk through the graph depth-first and print nodes as I go. Lets see what we get:

  1. BinaryExpression: Print left, operator and then right

  2. Identifier: Print name "a"

  3. Print operator "+"

  4. UpdateExpression: prefix is true, therefore print operator and then argument

  5. Print operator "++"

  6. Identifier: Print name "b"

Putting together all the print statements yields the source code a+++b. Is this correct? It certainly is syntactically correct, but does it preserve the semantics of the AST?

Inserting additional constructs might be required

What happens is that a+++b is evaluated as (a++)+b which is not the expression the AST above describes and really produces a different result and side effects. Sometimes whitespace, parentheses and semicolons might be necessary in order to preserve correct semantics. This particular example can be solved by inserting a space between the pluses:

a+ ++b // Correct, evaluates as a+(++b)
a+++b // Incorrect, evaluates as (a++)+b
a++ +b // Incorrect, same as a+++b

Babels algorithmic solution for this is simple: Whenever the current code ends with a plus and something starting with a plus should be appended, it inserts a space between the two pluses. But this is a really specific rule to one kind of problem and there are many other cases where whitespace, parentheses or semicolons are necessary.

Abstract syntax trees contains a lot of implicit information

From the example presented here I learned that an AST contains a lot of implicit information that has to be taken into account when converting it into source code. When is a semicolon necessary? When is whitespace necessary? There are obvious cases like the space character after the const keyword, but there are more nuanced cases like a+++b. They might be syntactically correct with or without whitespace in certain places but mean different things. And a source code generator that works with an AST has to respect all these rules. In conclusion: The work performed by babel-generator is way more difficult than I expected it to be!

Having thought about this for a while now I think the fundamental problem is the following: an AST is a much richer structure than a string or sequence of tokens. The string representation of the program needs additional helper constructs that group and order different parts, while the tree structure encodes this in its relations between nodes. From this perspective, an AST seems like a much more natural although less flexible representation of a program.