A Guide to NFA-Construction: A Simple Implementation

NFA Construction - Implementation

The Implementation


This guide follows a guide for NFA Construction. If this is your first article here it might help to start at the Compiler Construction Guide.

Now the tricky part comes, how do we implement this? For this, we will take a functional approach and start with building some regular expressions.

Regular Expressions Representation

We will represent a regular expression with an array that holds its type and the number of regular expressions it requires.

def build-epsilon():
    return ["Regular Expression", "Epsilon"]


def build-symbol(symbol):
    return ["Regular Expression", "SYMBOL", symbol]


def build-concat(re-1, re-2):
    return ["Regular Expression", "CONCAT", re-1, re-2]


def build-or(re-1, re-2):
    return ["Regular Expression", "OR", re-1, re-2]


def build-kleene_star(re):
    return ["Regular Expression", "KLEENE STAR", re]

Next, we will make a type checking system with a series of functions. This will allow us to ensure we are building the correct NFAs down the road.

def is-epsilon(re):
    return type(re) == list and re[0] == "Regular Expression" and re[1] == "Epsilon"


def is-symbol(re):
    return type(re) == list and re[0] == "Regular Expression" and re[1] == "SYMBOL"


def is-concat(re):
    return type(re) == list and re[0] == "Regular Expression" and re[1] == "CONCAT"


def is-or(re):
    return type(re) == list and re[0] == "Regular Expression" and re[1] == "OR"


def is-kleene-star(re):
    return type(re) == list and re[0] == "Regular Expression" and re[1] == "KLEENE STAR"

NFA Representation

Next, we have to come up with some representation of an NFA, just as we did with the regular expressions. For this, we will create a state representation, transition representation, and then finally an NFA representation which holds the typical five-tuple information, but we can exempt the alphabet and assume only one accepting state for now. All of these will have the same array structure as we saw with regular expressions.

STATE_ID = 0

def build-state():
    return ["NFA_STATE", ++STATE_ID]


def build-epsilon-transition(state-1, state-2):
    return ["NFA_TRANSITION", state-1, "EPSILON", state-2]


def build-transition(state-1, symbol, state-2):
    return ["NFA_TRANSITION", state-1, symbol, state-2]


def build-NFA(states, initial-state, accepting-state, transitions):
    return ["NFA", states, initial-state, accepting-state, transitions]

Transitioning from Regular Expressions to NFAs: The Base Cases

Construction diagrams for epsilon, symbol, or, concat, and kleene star
Figure 6: Full NFA from Regular Expression diagram

Now we move on to transitioning regular expressions that are represented with this code into NFAs. To do so we will make some helper functions that are tasked with one regular expression at a time. First, we have the base cases which are simple.

def nfa-from-re-epsilon():
    state = build-state()
    return build-NFA([state], state, state, [])

def nfa-from-re-symbol(re):
    initial-state = build-state()
    accepting-state = build-state()
    transition = build-transition(initial-state, re[2], accepting-state)
    return build-NFA([initial-state, accepting-state], initial-state, accepting-state, [transition])

Now we can start constructing our recursive function to build the NFA from a regular expression. We want to start this process now so that logic will follow in the other helper functions.

def nfa-from-re(re):
    if is-epsilon(re):
        return nfa-from-re-epsilon()
    else if is-symbol(re):
        return nfa-from-re-symbol(re)
    else:
        throw Exception("Unknown regular expression has been given to nfa-from-re")

Now as of right now this will error out on any complex regular expression. So we will start implementing the necessary helper functions. We simply wanted this done first so that we would not be calling an imaginary function in the meantime.

Transitioning from Regular Expressions to NFAs: The Recursive Cases

With the recursive cases, we recognize that all of these functions will have a new initial state and accepting state, so they will all start similarly. Along with that, they will all have to call nfa-from-re to construct they’re containing regular expressions as well. From there we will just have to build the transitions as according to Figure 1 above.

def nfa-from-re-concat(re):
    machine-1 = nfa-from-re(re[2])
    machine-2 = nfa-from-re(re[3])
    initial-state = build-state()
    accepting-state = build-state()
    transition-1 = build-epsilon-transition(initial-state, machine-1[2])
    transition-2 = build-epsilon-transition(machine-1[3], machine-2[2])
    transition-3 = build-epsilon-transition(machine-2[3], accepting-state)
    states = [initial-state] + machine-1[1] + machine-2[1] + [accepting-state]
    transitions = [transition-1, transition-2, transition-3] + machine-1[4] + machine-2[4]
    return build-NFA(states, initial-state, accepting-state, transitions)


def nfa-from-re-or(re):
    machine-1 = nfa-from-re(re[2])
    machine-2 = nfa-from-re(re[3])
    initial-state = build-state()
    accepting-state = build-state()
    transition-1 = build-epsilon-transition(initial-state, machine-1[2])
    transition-2 = build-epsilon-transition(initial-state, machine-2[2])
    transition-3 = build-epsilon-transition(machine-1[3], accepting-state)
    transition-4 = build-epsilon-transition(machine-2[3], accepting-state)
    states = [initial-state] + machine-1[1] + machine-2[1] + [accepting-state]
    transitions = [transition-1, transition-2, transition-3, transition-4] + machine-1[4] + machine-2[4]
    return build-NFA(states, initial-state, accepting-state, transitions)


def nfa-from-re-kleene-star(re):
    machine = nfa-from-re(re[2])
    initial-state = build-state()
    accepting-state = build-state()
    transition-1 = build-epsilon-transition(initial-state, machine[2])
    transition-2 = build-epsilon-transition(machine[3], machine[2])
    transition-3 = build-epsilon-transition(machine[3], accepting-state)
    transition-4 = build-epsilon-transition(initial-state, accepting-state)
    states = [initial-state] + machine[1] + [accepting-state]
    transitions = [transition-1, transition-2, transition-3, transition-4] + machine[4]
    return build-NFA(states, initial-state, accepting-state, transitions)

Now we can revisit our NFA-from-re function and finish it.

def nfa-from-re(re):
    if is-epsilon(re):
        return nfa-from-re-epsilon()
    else if is-symbol(re):
        return nfa-from-re-symbol(re)
    else if is-concat(re):
        return nfa-from-re-concat(re)
    else if is-or(re):
        return nfa-from-re-or(re)
    else if is-kleene-star(re):
        return nfa-from-re-kleene-star(re)
    else:
        throw Exception("Unknown regular expression has been given to nfa-from-re")

Testing and Debuging

Now would be a good time to stop and debug to make sure things work as predicted. Try building one of every recursive case as well as the base cases and print out their information in an easy-to-read manner. Then draw a diagram and make sure that it follows the pattern that you see in figure 6. Once you know that your NFA creation is good you should feel comfortable to move on. Here are a couple of test cases to try.

def print-nfa(nfa):
    print("Initial State:", nfa[2])
    print(f"Accepting State:", nfa[3])
    print("States:", ",".join(nfa[1]))
    print("Transitions:", "\n\t\t\t".join([x[1:] for x in nfa[4]]))

print("Symbol")
print-nfa(nfa-from-re(build-symbol("a")))

print("Concat")
print-nfa(nfa-from-re(build-concat(build-symbol("a"), build-symbol("b"))))

print("Or")
print-nfa(nfa-from-re(build-or(build-symbol("a"), build-symbol("b"))))

print("Kleene Star")
print-nfa(nfa-from-re(build-kleene-star(build-symbol("a"), build-symbol("b"))))

Now in terms of implementation, we can create many NFA’s to represent the language. But we would much prefer both in theory and in practice to have one NFA that can recognize our language. To do this we will need to “glue” NFAs together. To better understand what we’re trying to accomplish we will make a list of regular expressions that will then pass into a helper function to construct a list of NFAs. This list is what we will then glue together into one single NFA. We will also make a helper function to allow us to string many logical or statements together.

def regular-expressions-to-NFAs(regular-expressions):
    NFAs = []
    for re in regular-expressions:
        NFAs += [nfa-from-re(re)]
    return NFAs

def make-ors(regular-expressions):
    regular-expression = regular-expressions[0]
    for re in regular-expressions[1:]:
        regular-expression = build-or(re, regular-expression)
    return regular-expression


my-simple-lang = [
    make-ors("ab"),
    make-ors("12"),
]

Combining Multiple NFAs

Before building our NFA gluing function we should address something important. Up till this point, we had one regular expression which has one accepting state. No matter what we only had one accepting state to worry about. When we combine these NFAs we will now have multiple accepting states. So rather than give our build-NFA function a single state we will give it a set of states.


def combine-NFAs(NFAs):
    initial-state = build-state()
    states = [initial-state]
    transitions = []
    accepting-states = {}
    for NFA in NFAs:
        states += NFA[1]
        transitions += [build-epsilon-transition(initial-state, NFA[2])] + NFA[4]
        accepting-states = accepting-states ∪ {NFA[3]}
    return build-nfa(states, initial-state, accepting-states, transitions)

After all of that when you use the pretty print function we developed earlier then you should get something that looks like this if we are to assume that every transition without a symbol is an epsilon transition.

Figure 7: Simple Lang NFA Construction.

Differentiating NFAs From Combined NFAs

The last thing we will want to do is allow ourselves an easier time transitioning through the NFA. For this, we will create a “compiled NFA” which will return a list with a lambda function with the NFA. This lambda function will serve as our transition function for the specific NFA. Since as of right now we only have transitions stored in an array, we will want to also write a function that takes an NFA, state, and symbol and return a set of states reachable with that given symbol. We won’t concern ourselves with epsilon transitions, and the complexity that they bring now, because later we will solve this issue. We will assume that if you are on state 4 you will only put in epsilon as an input symbol.

def transition(NFA, state, symbol):
    states = set()
    for transition in NFA[4]:
        if transition[1] == state and transition[2] == symbol:
            states = states ∪ {transition[3]}
    return states

def compile-NFA(NFA):
    return ["COMPILED-NFA", NFA, (state, symbol) => transition(NFA, state, symbol)]

Congratulations you have now constructed an NFA from a regular expression. In the next section, we will look to better this and get closer to a final lexical analyzer generator.

Books

Listed below are a few of my favorite books that taught me a fair amount of compiler construction. If you have books that benefited you on this topic, please share them in the comments below and expand this list!