This week I spent a few hours working on webassemblyjs1 trying to improve our webassembly text format parser. Up to now I have mostly contributed code for dealing with number literals in that package and currently aim to get this part into a good shape. One step in this process is called lexing or tokenizing which essentially decides if a string is a number or not. This is not as easy as it sounds, because webassembly’s text format supports various formats for number literals (Floating point hexadecimals for example). The cool thing about this project is that it is pretty well-defined how it should work, namely by the webassembly specification. To make my goal of “improve the parser” more measurable I decided to add tests involving as many valid number literals as possible.

I had this on my mind for some time, but feared that it was either too much manual work and difficult to automate. Fortunately it was quite easy to automatically extract number literal stings from the official spec tests2 and convert them to tokenizer tests for our project. Here is how I did it:

Extract number literals from spec tests

#!/usr/bin/env bash

# integer literals
curl \
	| head -n70 \
	| grep -oP "(i(32|64).const\K\s*((\d|-|x|_|\+|[a-zA-Z])*))" \
	> packages/wast-parser/test/tokenizer/raw/int_literals.txt

# float literals
curl \
	| head -n181 \
	| grep -oP "(f(32|64).const\K\s*((\d|-|x|_|\+|\.|[a-zA-Z]|:)*))" \
	> packages/wast-parser/test/tokenizer/raw/float_literals.txt

View on GitHub

I wrote this script which creates new files int_literals.txt and float_literals.txt containing a lot of number literals. It uses curl to download test files containing webassembly code with lots of number literals. Then head cuts off everything after line 70 (181), because after that only malformed number literals appear in the file (luckily they grouped it this way). For the grep call I had to do quite some googling but got it working at the end. It basically checks for lines containing i32.const X, i64.const X, f32.const X or f64.const X and extracts the X part from it. Note that I am not actually checking whether X is a number literal but know it from the looking at the test file myself. Therefore the regular expressions are way too generous and accept a lot more than number literals, but that is okay.

Create new testcases from number literals

Okay so now I have two files containing one valid number literal per line. The tokenizer tests in webassemblyjs work with an actual.wast and an expected.json file. The first one contains webassembly code and the second one a list of tokens in JSON format. Since this is just the tokenizer there is no need to worry about creating a proper webassembly module, so the actual.wast file can contain just the number literal. That means the expected output should be a list of exactly one number token for this literal. At first I tried to do this with awk since it seemed appropriate for performing a repetitive task for each line. I almost got it working, but my awk skills were too limited and the process too frustrating. So I used node instead which made reading and writing files a bit more verbose, but dealing with JSON a lot easier. Here is what I ended up with:

#!/usr/bin/env node

const fs = require('fs')
const path = require('path')

const packageDir = './packages/wast-parser/test/tokenizer/'

const allIntegers = fs.readFileSync(path.join(packageDir, 'raw/int_literals.txt'), 'utf-8')
  .map(s => s.trim())
  .filter(s => s.length > 0)

const allFloats = fs.readFileSync(path.join(packageDir, 'raw/float_literals.txt'), 'utf-8')
  .map(s => s.trim())
  .filter(s => s.length > 0)

const all = [ ... allIntegers, ... allFloats ]

const expected = literal => JSON.stringify([{
  "type": "number",
  "value": literal,
  "loc": {
    "start": {
      "line": 1,
      "column": 1
}], null, 2)

all.forEach(literal => {
  const dir = path.join(packageDir, `number-literals/${literal}/`)

  if (!fs.existsSync(dir)){

  fs.writeFileSync(path.join(dir, 'actual.wast'), literal)
  fs.writeFileSync(path.join(dir, 'expected.json'), expected(literal))


View on GitHub

This was a fun experience in automating a task with some scripts! Except for the grep thing the code was easy to write and definitely a lot more efficient than manual copy and pasting. Of course I could have done that (and even the downloading) from the node script too, but for this part I prefer the grep solution.