Most of the time when I read and try to understand other people’s code I end up learning something. With this post I want to try a new writing format inspired by this fact: Read open source code and try to explain how it works. Or even better, try to find answers to my own questions deep in the details of source code. I have done this only a couple of times in the past and realized that reading source code is a skill of its own. You need to develop an intuition to find the key parts, because there is a million rabbit holes to get lost in. Today’s project is Babel1, more precisely its parser babylon2.

## Recursive descent parsers

My knowledge of parsers is still quite limited, but I recently started looking at the theory behind them. After learning about grammars, lexers and tokens I am currently looking at parsing algorithms. A parsing algorithm should answer the question: How do we check a sequence of tokens against a grammar and construct a corresponding abstract syntax tree (AST).

One such algorithm or type of parser is called recursive descent parser3. This is the first one I learned about and I am curious if babylon implements it. The main advantage of this strategy is that it is easy to reason about. A parser is quite an involved piece of code to write and needs to cover all edge cases allowed by the grammar, so simplicity is a plus in this situation.

The algorithm works in a top-down fashion. It starts with the full task (parse a program) and splits it into smaller subtasks (parse a variable declaration). The difficulty lies in the fact that this division step is not deterministic. In JavaScript, a program can consist of any number of statements and each statement can be of any kind. So how does a recursive descent parser split the large tasks into smaller ones? It simply tries all possibilities, one after another and uses the first one that works. To try a possibility means to check if the input code matches a grammar rule. So we could parse a JavaScript statement like this:

// Input code
const a = 3

• Is this an if-statement? No.
• Is this a for-loop? No.
• Is this a function declaration?
• Is this a named function declaration? No.
• Is this an anonymous function declaration? No.

–> No.

. . .

• Is this a variable declaration? Yes.
• Is this a var declaration? No.
• Is this a let declaration? No.
• Is this a const declaration? Yes.

In reality there would be more intermediate steps, because a variable declaration for example allows multiple variables to be declared, i.e. const a,b;. But I hope you can see the point: Try any syntactically allowed structure (grammar rule) until one matches4. And as indicated by the name, this procedure is recursive even in the case when it does not match. It might be obvious that const a = 3 is not a function declaration, but remember that the parser has to work correctly for any syntactically valid source code which includes some really weird stuff5. An additional benefit is that since this algorithm does not use any specifics of the grammar, there are tools that can generate the parser from the grammar without any human interaction. Of course, the downside is performance.

The key point of the algorithm is that it tries to parse input according to a rule and has to roll back if it does not match. It does not decide what rule to use and therefore it can happen that parts of the input are processed multiple times.

## Is Babylon a recursive descent parser?

What follows is a polished version of my notes from digging through babylon’s source code. The goal is to find out if the parser implements a recursive descent strategy or something else. Everytime I reference a class or method name for the first time, I link it to open the corresponding source code in a new tab. All of these links target the v7.0.0-beta.38 version of babel, so they will make sense even after future releases change the code.

Babylon exposes a parse method that takes program text as input and a File object as output. This type has a field called File.program where the root of the abstract syntax tree (AST) is stored. Actually I just went to astexplorer.net and saw that File is the root node of the AST there. None the less, we are on the right track to find the core parsing algorithm. The parse function contains some code to handle options and plugins. As far as I know, Babel has no public API for parser plugins, but it looks like something similar is used internally. I will ignore all of this though and dig deeper into the Parser class which the input and options are passed to.

Here we can see the Parser class interacting with the given input for the first time. If the source file contains a leading hashbang line like #!/usr/bin/env node, this code is responsible for skipping it. It looks like this.state.pos points to the part of the input that still needs to be parsed. Since the hashbang needs to be the first line of a file, the check here is this.state.pos === 0. A successful check will call this.skipLineComment, which moves the position pointer on to the second line I assume.

The hashbang skip is part of the constructor and not the main Parser.parse method. Maybe this is because the hashbang is not valid JavaScript syntax and it is easier to deal with this exception separately. In any case, we are now at the first JavaScript character in the input and want to start parsing it. This process seems to be kicked off with this.nextToken and this.parseTopLevel which are both inherited from the parent class StatementParser.

Now we are deep into the code: 1697 lines of source code and most methods names contain the word parse. The header comment for parseTopLevel explains that a program is a sequence of statements and this method is responsible for parsing that. Following the method calls parseBlockBody and parseBlockOrModuleBlockBody we find the first lines of code translating input text into AST nodes. The field Program.body was passed all the way down here where we now insert statements constructed by parseStatement and parseStatementContent. This array represents the sequence of statements that make up the program at the top level.

Here comes the point where we can check if babylon uses a recursive descent parsing strategy. A statement can be many things, so we need to decide what grammar rule to try now. As explained above, in a recursive descent parser we would simply try one after another and roll back whenever a grammar rule does not work for the given input.

But that is not what’s happening here. As explained in a source code comment, it is possible to recognize most statement types by the next few tokens. And indeed, the switch statement in this method determines the type of statement by the type of the current token stored in this.state.type.

In some cases it also checks the next token’s type with the lookaheadmethod. How many tokens does it need to check at most to determine the next grammar rule? I looked at the source of lookahead and found that it only returns one token after the current one. There are other means to access more tokens though, so I am not sure how to find that number. I guess the exact number also depends on the granularity of tokens, but I am not sure about that.

But I am pretty confident in the fact that babylon does not implement a recursive descent parsing algorithm, because it makes decisions about what type of grammar rules to apply instead of trying them all. It is still top down however, we start of by calling parse methods for Program, then Statement and more and more specific grammar rules after that. There is probably a name for this parsing algorithm which I will hopefully learn soon.

## Conclusion and further questions

Babylon is not a recursive descent parser. The JavaScript grammar seems simple enough to determine the grammar rules by only looking at the next few tokens. This naturally leads me to the following questions:

• What parsing algorithm is this? Is it possible to generalize it or is it specific to one grammar?
• How many tokens does the parser need to see in order to determine the applicable grammar rule?

I will have to catch up with theory to answer these, but I hope a future post will do so. I discovered another thing when reading the source code: The process of lexing (tokenizing) and parsing seem to be coupled in babylon. All theory I looked at presented it as two separate processes (functions):

1. lexer(text) -> [...tokens...]
2. parser([...tokens...]) -> ast

From what I can tell, babylon calls nextToken while parsing so both things are happening at the same time. So a third question I have is:

• Can babylon do lexing without parsing?

If you know the answers, let me know! I might do a follow-up post to address them.

1. The order in which we try the rules is important and needs to be defined by the grammar.

2. I encountered some fun edge cases in Babel pull requests, for example https://github.com/babel/babel/pull/6855#discussion_r152961191