Most of the time when I read and try to understand other people’s code I end up learning something. With this post I want to try a new writing format inspired by this fact: Read open source code and try to explain how it works. Or even better, try to find answers to my own questions deep in the details of source code. I have done this only a couple of times in the past and realized that reading source code is a skill of its own. You need to develop an intuition to find the key parts, because there is a million rabbit holes to get lost in. Today’s project is Babel1, more precisely its parser babylon2.
Recursive descent parsers
My knowledge of parsers is still quite limited, but I recently started looking at the theory behind them. After learning about grammars, lexers and tokens I am currently looking at parsing algorithms. A parsing algorithm should answer the question: How do we check a sequence of tokens against a grammar and construct a corresponding abstract syntax tree (AST).
One such algorithm or type of parser is called recursive descent parser3. This is the first one I learned about and I am curious if babylon implements it. The main advantage of this strategy is that it is easy to reason about. A parser is quite an involved piece of code to write and needs to cover all edge cases allowed by the grammar, so simplicity is a plus in this situation.
// Input code const a = 3
- Is this an
- Is this a
- Is this a function declaration?
- Is this a named function declaration? No.
- Is this an anonymous function declaration? No.
. . .
- Is this a variable declaration? Yes.
- Is this a
- Is this a
- Is this a
- Is this a
In reality there would be more intermediate steps, because a variable declaration for example
allows multiple variables to be declared, i.e.
But I hope you can see the point:
Try any syntactically allowed structure (grammar rule) until one matches4.
And as indicated by the name, this procedure is recursive even in the case when it does not match.
It might be obvious that
const a = 3 is not a function declaration,
but remember that the parser has to work correctly for any syntactically valid source code
which includes some really weird stuff5.
An additional benefit is that since this algorithm does not use any specifics of the grammar,
there are tools that can generate the parser from the grammar without any human interaction.
Of course, the downside is performance.
The key point of the algorithm is that it tries to parse input according to a rule and has to roll back if it does not match. It does not decide what rule to use and therefore it can happen that parts of the input are processed multiple times.
Is Babylon a recursive descent parser?
What follows is a polished version of my notes from digging through babylon’s source code.
The goal is to find out if the parser implements a recursive descent strategy or something else.
Everytime I reference a class or method name for the first time,
I link it to open the corresponding source code in a new tab.
All of these links target the
v7.0.0-beta.38 version of babel,
so they will make sense even after future releases change the code.
Babylon exposes a
that takes program text as input and a
File object as output.
This type has a field called
File.program where the root of the abstract syntax tree (AST) is stored.
Actually I just went to astexplorer.net and saw that
File is the root node of the AST there.
None the less, we are on the right track to find the core parsing algorithm.
parse function contains some code to handle options and plugins.
As far as I know, Babel has no public API for parser plugins, but it looks like something similar is used internally.
I will ignore all of this though and dig deeper into the
Parser class which the input and options are passed to.
Here we can see the
Parser class interacting with the given input for the first time.
If the source file contains a leading hashbang line like
#!/usr/bin/env node, this code is responsible for skipping it.
It looks like
this.state.pos points to the part of the input that still needs to be parsed.
Since the hashbang needs to be the first line of a file, the check here is
this.state.pos === 0.
A successful check will call
this.skipLineComment, which moves the position pointer on to the second line I assume.
The hashbang skip is part of the constructor and not the main
This process seems to be kicked off with
which are both inherited from the parent class
Now we are deep into the code: 1697 lines of source code and most methods names contain the word
The header comment for
parseTopLevel explains that a program is a sequence of statements and this method is responsible for parsing that.
Following the method calls
parseBlockOrModuleBlockBody we find the first lines of code
translating input text into AST nodes.
Program.body was passed all the way down here where we now insert statements
This array represents the sequence of statements that make up the program at the top level.
Here comes the point where we can check if babylon uses a recursive descent parsing strategy. A statement can be many things, so we need to decide what grammar rule to try now. As explained above, in a recursive descent parser we would simply try one after another and roll back whenever a grammar rule does not work for the given input.
But that is not what’s happening here. As explained in a source code comment,
it is possible to recognize most statement types by the next few tokens.
And indeed, the switch statement in this method determines the type of statement by the type of the current token
In some cases it also checks the next token’s type with the
How many tokens does it need to check at most to determine the next grammar rule?
I looked at the source of
lookahead and found that it only returns one token after the current one.
There are other means to access more tokens though, so I am not sure how to find that number.
I guess the exact number also depends on the granularity of tokens, but I am not sure about that.
But I am pretty confident in the fact that babylon does not implement a recursive descent parsing algorithm,
because it makes decisions about what type of grammar rules to apply instead of trying them all.
It is still top down however, we start of by calling parse methods for
Statement and more and more specific grammar rules after that.
There is probably a name for this parsing algorithm which I will hopefully learn soon.
Conclusion and further questions
- What parsing algorithm is this? Is it possible to generalize it or is it specific to one grammar?
- How many tokens does the parser need to see in order to determine the applicable grammar rule?
I will have to catch up with theory to answer these, but I hope a future post will do so. I discovered another thing when reading the source code: The process of lexing (tokenizing) and parsing seem to be coupled in babylon. All theory I looked at presented it as two separate processes (functions):
lexer(text) -> [...tokens...]
parser([...tokens...]) -> ast
From what I can tell, babylon calls
nextToken while parsing so both things are happening at the same time.
So a third question I have is:
- Can babylon do lexing without parsing?
If you know the answers, let me know! I might do a follow-up post to address them.
The order in which we try the rules is important and needs to be defined by the grammar. ↩
I encountered some fun edge cases in Babel pull requests, for example https://github.com/babel/babel/pull/6855#discussion_r152961191. ↩