Mauro Bringolf

Currently geeking out as WordPress developer at WebKinder and student of computer science at ETH.

Defining my current understanding of a URL as EBNF

January 15, 2017
, ,

What exactly is a valid URL?

It’s a simple question with a simple answer, but a great example to explore the concept of an extended Backus-Naur-form. EBNF is a formal way of describing a set of strings and captures the idea of their pattern. In this case the structure it should capture is the one of a URL. It describes the structure in question with formal rules. For any given string, these rules will answer the question if it is a URL or not. The EBNF description is a theoretical object and not a program. But of course the second step is to try and write a program that performs this task.

This post is an attempt of growing and formally describing my understanding of a URL. My EBNF is incomplete in some ways, but I hope not incorrect. What I mean is it might not capture all strings that are valid URLs, but it does not capture any that are not. It is restricted to website URLs and misses some edge cases. I did not want to look into character encoding or similar things and get lost in details. From what I can tell, the range of valid URLs also depends on the server setup to a certain extent. What you see is a formal description of my current understanding of a website URL:

lowercase_letter = "a" | "b" | "c" | "d" | "e" |  "f" |  "g" |  "h" |  "i" | "j" | "k" | "l" | "m" | "n" | "o" | "p" | "q" | "r" | "s" | "t" | "u" | "v" | "w" | "x" | "y" | "z"
uppercase_letter = "A" | "B" | "C" | "D" | "E" | "F" | "G" | "H" | "I" | "J" | "K" | "L" | "M" | "N" | "O" | "P" | "Q" | "R" | "S" | "T" | "U" | "V" | "W" | "X" | "Y" | "Z"
letter = lowercase_letter | uppercase_letter
digit = "0" | "1" | "2" | "3" | "4" | "5" | "6" | "7" | "8" | "9"
alphanumeric = letter | digit
lowercase_alphanumeric = lowercase_letter | digit

hyphen = "-" | "_"

lowercase_word = lowercase_alphanumeric { lowercase_alphanumeric }
word = alphanumeric { alphanumeric }
word_with_hyphens = word { hyphen word }
number = digit { digit }
key_value_pair = word_with_hyphens "=" word_with_hyphens
domain_delimiter = "." | "-"

protocol = "http" | "https"
domain = lowercase_word { domain_delimiter lowercase_word }
port = ":" number
path = "/" { word_with_hyphens "/" }
parameters = "?" key_value_pair { "&" key_value_pair }
anchor = "#" word_with_hyphens

url = protocol "://" domain [ port ] path [ parameters ] [ anchor ]

Reading this stuff up was not only tedious

Okay I admit, most of it was tedious. But I also stumbled upon a few interesting facts that are worth highlighting: