In the last OSS deep dive post1 I looked at babylon’s source code and learned that it does not do recursive descent parsing but some other strategy. At the end of the post I was left with a few questions that I was able to answer in the mean time. It turns out that the first two questions were answered in a lecture I watched the day I wrote the first post. Here are some of my notes from the lecture and a closer look at tokens in babylon.

What parsing algorithm is this? Is it possible to generalize it or is it specific to one grammar?

The parsing strategy of babylon is called predictive parsing. Given the current token, the parser is able to predict the correct grammar rule to apply. This is exactly what is happening in the switch statement over the token type we saw last time. However, not all context free grammars allow this strategy. The set of grammars where the next $k$ tokens determine one production rule is called $LL(k)$. Interestingly enough, this is not a property of the language but the grammar. Sometimes the grammar can be “fixed” to be in $LL(1)$ and be parsed efficiently with a lookahead of $1$ for example. The details still escape my understanding at this point though.

How many tokens does the parser need to see in order to determine the applicable grammar rule?

Since the lookahead method returns one more token after the current one, I believe this number to be $k=1$ in babylon. For now I am more interested in the last question.

Can babylon do lexing without parsing?

Lexing (or tokenizing) is the process of aggregating the smallest units of text in a program. This is similar to dividing raw text (a sequence of characters) into a sequence words and punctuation. The code I looked through in the previous article already operated on a higher level of abstraction, it deals with tokens and not raw text. But I found out that you can ask babylon to give you not only the AST, but also the sequence of tokens. To do this, simply pass the option tokens: true to a babylon.parse call.

Here is an example:

const { parse } = require('babylon')

const input = 'const x = 3;'

const output = parse(input, { tokens: true })

console.log(output.tokens) // Array of Token objects


This way we get the raw result of the lexing process. The first token in the example above for example is output.tokens[0]:

Token {
type:
KeywordTokenType {
label: 'const',
keyword: 'const',
beforeExpr: false,
startsExpr: false,
rightAssociative: false,
isLoop: false,
isAssign: false,
prefix: false,
postfix: false,
binop: null,
updateContext: null
},
value: 'const',
start: 0,
end: 5,
loc: SourceLocation {
start: Position {
line: 1,
column: 0
},
end: Position {
line: 1,
column: 5
}
}
}


So yes, we can get the sequence of tokens from babylon. But it seems to be coupled with parsing and AST construction in this case.