Tokenization

This article follows a guide for Compiler Construction. If this is your first article here, it might help to start at the Compiler Construction Guide.

To guide, a tokenizer for our derivative-based construction will look much like our NFA’s construction’s tokenizer. When we approached this, we reminded ourselves of what we were trying to accomplish with Figure 1. It may help just to refresh yourself with this figure again.

Demonstration of the source program being given to the find-lexeme function then tokenizer and then parser — Figure 1: The idea behind tokenization

Finding a Lexeme

With our NFA construction, we needed to consider states and the location of the current parse. With the current implementation we are building, we do not worry about the state. But that does mean we need to find another manner to look through the source program.

The idea is pretty close still. We will loop through the file one string at a time and look for the largest matching regular expression. We will use the same records list as before; this time, we will have boolean values for each of the regular expressions instead of a set of states. All of these are false; then we know we have passed the largest matching string and can look back to get it.

def find-lexeme(regular-expressions, source-program, initial-string-index):
    current-string-index = initial-string-index + 1
    records = []
    while source_program.length > current_string_index:
        string = source_program[initial-string-index: current-string-index]
        re-matches = [match(re, string) for re in regular-expressions]
        records += [[initial-string-index, current-string-index, re-matches]]
        if True not in re-matches:
            if records.length == 1:
                break
            record = records[-2]
            index = record[1]
            re_index = record[2].index(True)
            return index, re-index
        current-string-index += 1
    return False, False

Building the Lexer

We will take the same approach as before, with an object represented as an array. This array having 2 functions within it, has-next-token and next-token. To do this, though, that means we once again need our helper function of next-token, which we will then wrap in a lambda function around it.

def next-token(source-program, regular-expressions, tokenizing-functions, reference):
    i = reference["i"]
    j, token-id = find_lexeme(regular-expressions, source-program, i)
    lexeme = source-program[i:j]
    token = tokenizing-functions[token-id](lexeme)
    reference["i"] = j
    return token

Now we can move on to write our constructor for the lexer, which will once again follow similarly to the NFA’s version.

def build-lexer(regular_expressions, tokenizing_functions):
    reference = {"i": 0}
    has-next-token = lambda source: find-lexeme(regular-expressions, source, reference["i"])[0]
    next-token = lambda source: next_token(source, regular-expressions, tokenizing-functions, reference)
    return ["LEXER", has-next-token, next-token]

Testing

Lastly, we just want to test it to make sure everything works out correctly. We will use the same language as before with simple numbers.

def build-sigma(alphabet):
    re = build-symbol(alphabet[0])
    for symbol in alphabet[1:]:
        re = build-or(build-symbol(symbol), re)
    return re


regular-expressions = [
    build-concat(build-sigma("012345679"), build-kleene-star(build-sigma("0123456789"))),
    build-symbol("+"),
    build-symbol("_"),
    build-symbol(" ")
]

tokenizing-functions = [
    lambda lexeme: ["INT", lexeme, int(lexeme)],
    lambda lexeme: ["PLUS", lexeme, None],
    lambda lexeme: ["MINUS", lexeme, None],
    lambda lexeme: ["IGNORE", lexeme, None]
]


source-program = "10 + 5 - 3"
lexer = build-lexer(regular-expressions, tokenizing-functions)
while lexer[1](source-program) is not False:
    print(lexer[2](source-program))

This code should give us an output that looks like this.

['INT', '10', 10]
['IGNORE', ' ', None]
['PLUS', '+', None]
['IGNORE', ' ', None]
['INT', '5', 5]
['IGNORE', ' ', None]

Books

Listed below are a few of my favorite books that taught me a fair amount of compiler construction. If you have books that benefited you on this topic, please share them in the comments below and expand this list!

Introduction to Languages and the Theory of Computation, By John Martin
Regular Algebra and Finite Machines, By John Conway
Compilers: Principles, Techniques, and Tools, By Alfred Aho, Monica Lam, Ravi Sethi, and Jeffrey Ullman
Introduction to the Theory of Computation, By Michael Sipser

How to Perform Tokenization on a Derivative-Based Lexer

Tokenization

Finding a Lexeme

Building the Lexer

Testing

Books

Leave a Reply Cancel reply