Tokenization
This article follows a guide for Compiler Construction. If this is your first article here, it might help to start at the Compiler Construction Guide.
To guide, a tokenizer for our derivative-based construction will look much like our NFA’s construction’s tokenizer. When we approached this, we reminded ourselves of what we were trying to accomplish with Figure 1. It may help just to refresh yourself with this figure again.

Finding a Lexeme
With our NFA construction, we needed to consider states and the location of the current parse. With the current implementation we are building, we do not worry about the state. But that does mean we need to find another manner to look through the source program.
The idea is pretty close still. We will loop through the file one string at a time and look for the largest matching regular expression. We will use the same records list as before; this time, we will have boolean values for each of the regular expressions instead of a set of states. All of these are false; then we know we have passed the largest matching string and can look back to get it.
def find-lexeme(regular-expressions, source-program, initial-string-index):
current-string-index = initial-string-index + 1
records = []
while source_program.length > current_string_index:
string = source_program[initial-string-index: current-string-index]
re-matches = [match(re, string) for re in regular-expressions]
records += [[initial-string-index, current-string-index, re-matches]]
if True not in re-matches:
if records.length == 1:
break
record = records[-2]
index = record[1]
re_index = record[2].index(True)
return index, re-index
current-string-index += 1
return False, False
Building the Lexer
We will take the same approach as before, with an object represented as an array. This array having 2 functions within it, has-next-token and next-token. To do this, though, that means we once again need our helper function of next-token, which we will then wrap in a lambda function around it.
def next-token(source-program, regular-expressions, tokenizing-functions, reference):
i = reference["i"]
j, token-id = find_lexeme(regular-expressions, source-program, i)
lexeme = source-program[i:j]
token = tokenizing-functions[token-id](lexeme)
reference["i"] = j
return token
Now we can move on to write our constructor for the lexer, which will once again follow similarly to the NFA’s version.
def build-lexer(regular_expressions, tokenizing_functions):
reference = {"i": 0}
has-next-token = lambda source: find-lexeme(regular-expressions, source, reference["i"])[0]
next-token = lambda source: next_token(source, regular-expressions, tokenizing-functions, reference)
return ["LEXER", has-next-token, next-token]
Testing
Lastly, we just want to test it to make sure everything works out correctly. We will use the same language as before with simple numbers.
def build-sigma(alphabet):
re = build-symbol(alphabet[0])
for symbol in alphabet[1:]:
re = build-or(build-symbol(symbol), re)
return re
regular-expressions = [
build-concat(build-sigma("012345679"), build-kleene-star(build-sigma("0123456789"))),
build-symbol("+"),
build-symbol("_"),
build-symbol(" ")
]
tokenizing-functions = [
lambda lexeme: ["INT", lexeme, int(lexeme)],
lambda lexeme: ["PLUS", lexeme, None],
lambda lexeme: ["MINUS", lexeme, None],
lambda lexeme: ["IGNORE", lexeme, None]
]
source-program = "10 + 5 - 3"
lexer = build-lexer(regular-expressions, tokenizing-functions)
while lexer[1](source-program) is not False:
print(lexer[2](source-program))
This code should give us an output that looks like this.
['INT', '10', 10]
['IGNORE', ' ', None]
['PLUS', '+', None]
['IGNORE', ' ', None]
['INT', '5', 5]
['IGNORE', ' ', None]
Books
Listed below are a few of my favorite books that taught me a fair amount of compiler construction. If you have books that benefited you on this topic, please share them in the comments below and expand this list!
- Introduction to Languages and the Theory of Computation, By John Martin
- Regular Algebra and Finite Machines, By John Conway
- Compilers: Principles, Techniques, and Tools, By Alfred Aho, Monica Lam, Ravi Sethi, and Jeffrey Ullman
- Introduction to the Theory of Computation, By Michael Sipser
