How Tokenization works in an NFA Constructed Lexer

NFA Construction - Tokenization

Tokenization


This guide follows a guide for NFA Construction. If this is your first article here it might help to start at the Compiler Construction Guide.

We’ve come a long way in constructing our NFA Construction, but it is not over yet. We still need to apply the NFA to our source program. To do this, we will attempt to match the largest string we possibly can without reaching a failing state. Once we reach the largest string possible, we will call it our lexeme. Leading us to build our token with the lexeme and return the token in our get-next-token function.

Figure 1: The idea behind tokenization

Finding a Lexeme

Creating a lexeme will take the subset construction, source program, and an initial index. The initial index is to tell our function where to begin scanning the source program. The idea here is that we will take a character from the source program and check it against the subset construction. Then continue to do so until we reach a failing state, in this case, the empty set.

Once a failing state is reached, we want to look back through what we have done and find the last occurrence of a successful state. Then either return that successful state or false. If we see false, that is our queue that we are at an EOF.

def find-lexeme(subset-construction, source-program, initial-string-index):
    current-string-index = initial-string-index
    current-state = subset-construction[1]
    records = []
    while length(source-program) > current-string-index:
        symbol = source-program[current-string-index]
        current-state = subset-construction[2](current-state, symbol)
        records += [(current-state, current-string-index, subset-construction[3](current-state))]
        current-string-index += 1
        if records[-1][0] == set():
            break

    for record in reverse(records):
        if last(record) is not False:
            return record[1] + 1
    return False

Creating a simple langauge for testing

In the next bit, we will develop a straightforward language to derive some ideas—this language is a simple math language with integers, addition, and subtraction. First, we need to have these regular expressions created. We will make our lives easier with a build-sigma function which will take the characters from a string and or them together. Then we will use that function to define our regular expressions.

def build-sigma(alphabet):
    re = build-symbol(alphabet[0])
    for symbol in alphabet[1:]:
        re = build-or(build-symbol(symbol), re)
    return re

regular-expressions = [
    build-concat(build-sigma("012345679"), build-kleene-star(build-sigma("0123456789"))),
    build-symbol("+"),
    build-symbol("-"),
    build-symbol(" ")
]

Now that we have some regular expressions, we want to define some tokens. Before programming, we want to discuss how we will structure the tokens. Tokens should be a 2-tuple with the token name, lexeme, and an optional value. With the optional value, we will generate it from a lambda function that we will store in an array that is one-to-one with our regular expressions array.

tokenizing_functions = [
    lambda lexeme: ["INT", lexeme, int(lexeme)],
    lambda lexeme: ["PLUS", lexeme, None],
    lambda lexeme: ["MINUS", lexeme, None],
    lambda lexeme: ["IGNORE", lexeme, None]
]

Building the Lexer

Our lexer will be another object represented as an array. This one will have our has-next-token and next-token functions in it. To begin, we need to make a next-token helper function using our find-lexeme function from before. It will use a hash map or global variable to store a reference point to track where we begin scanning. And it will use our tokenizing functions list to build the tokens.

def next-token(source-program, success-states, tokenizing-functions, subset-NFA, reference):
    i = reference["i"]
    j, state = find-lexeme(subset-NFA, source-program, i)
    lexeme = source-program[i:j]
    token-id = success-states.index-of(state)
    token = tokenizing-functions[token-id](lexeme)
    reference["i"] = j
    return token

With the assistance of the next-token function, we can make our build-lexer function. This function will connect everything we have done so far. It will build individual NFAs, glue them together, compile them, and use subset construction to simplify them. Then it will set everything up for creating the has next-token and next-token functions.


def build-lexer(alphabet, regular-expressions, tokenizing-functions):
    NFAs = [NFA-from-re(re) for re in regular-expressions]
    NFA = combine-NFAs(NFAs)
    compiled-NFA = compile-NFA(NFA)
    subset-NFA = build-subset-construction(compiled-NFA, alphabet)
    success-states = [NFA[3] for NFA in NFAs]
    reference = {"i": 0}
    has-next-token = lambda source-program: find-lexeme(subset-NFA, source-program, reference["i"])[0]
    next-token_ = lambda source-program: next-token(source-program, success-states, tokenizing-functions, subset-NFA, reference)
    return ["LEXER", has-next-token, next-token_]

Congratulations, you have now created a lexical analyzer generator. It is not the best or more efficient lexical analyzer, but it’s a starting point. Next, we will discuss an optimization feature for the NFA creation process that may or may not save time.

Books

Listed below are a few of my favorite books that taught me a fair amount of compiler construction. If you have books that benefited you on this topic, please share them in the comments below and expand this list!