What exactly is a valid URL?

It’s a simple question with a simple answer, but a great example to explore the concept of an extended Backus-Naur-form. EBNF is a formal way of describing a set of strings and captures the idea of their pattern. In this case the structure it should capture is the one of a URL. It describes the structure in question with formal rules. For any given string, these rules will answer the question if it is a URL or not. The EBNF description is a theoretical object and not a program. But of course the second step is to try and write a program that performs this task.

This post is an attempt of growing and formally describing my understanding of a URL. My EBNF is incomplete in some ways, but I hope not incorrect. What I mean is it might not capture all strings that are valid URLs, but it does not capture any that are not. It is restricted to website URLs and misses some edge cases. I did not want to look into character encoding or similar things and get lost in details. From what I can tell, the range of valid URLs also depends on the server setup to a certain extent. What you see is a formal description of my current understanding of a website URL:

lowercase_letter = "a" | "b" | "c" | "d" | "e" |  "f" |  "g" |  "h" |  "i" | "j" | "k" | "l" | "m" | "n" | "o" | "p" | "q" | "r" | "s" | "t" | "u" | "v" | "w" | "x" | "y" | "z"
uppercase_letter = "A" | "B" | "C" | "D" | "E" | "F" | "G" | "H" | "I" | "J" | "K" | "L" | "M" | "N" | "O" | "P" | "Q" | "R" | "S" | "T" | "U" | "V" | "W" | "X" | "Y" | "Z"
letter = lowercase_letter | uppercase_letter
digit = "0" | "1" | "2" | "3" | "4" | "5" | "6" | "7" | "8" | "9"
alphanumeric = letter | digit
lowercase_alphanumeric = lowercase_letter | digit

hyphen = "-" | "_"

lowercase_word = lowercase_alphanumeric { lowercase_alphanumeric }
word = alphanumeric { alphanumeric }
word_with_hyphens = word { hyphen word }
number = digit { digit }
key_value_pair = word_with_hyphens "=" word_with_hyphens
domain_delimiter = "." | "-"

protocol = "http" | "https"
domain = lowercase_word { domain_delimiter lowercase_word }
port = ":" number
path = "/" { word_with_hyphens "/" }
parameters = "?" key_value_pair { "&" key_value_pair }
anchor = "#" word_with_hyphens

url = protocol "://" domain [ port ] path [ parameters ] [ anchor ]

Reading this stuff up was not only tedious

Okay I admit, most of it was tedious. But I also stumbled upon a few interesting facts that are worth highlighting:

  • The browser hides details like trailing slashes or protocols from the user.

  • The anchor of a url is not sent to the server and only used by the client.

  • The query parameters must be before the anchor.

  • The default ports for HTTP and HTTPS are 80 and 443.

  • Depending on the server architecture the path can be case sensitive, meaning you could technically use uppercase letters. But it is not advised for several reasons and I see no benefit in doing it.