A Guide to Better Understand Lexical Analyzer Generators

Lexical Analysis Generators

Lexical Analyzer Generators


A lexical analyzer generator is a tool that allows many lexical analyzers to be created with a simple build file. Such a build file would provide a list of declarations that provide the generator the context it needs to develop a lexical analyzer.

Figure 1: Relationships between the lexical analyzer generator and the lexer

That opens the question of what should this build file look like. In Chapter 3 of Compiler Principles, Techniques, and Tools they describe a lexer that takes a file with 3 sections. The three sections are declarations, translation rules, and auxiliary functions.

Lex – Declarations

The declarations are manifest constants and regular definitions. Manifest definitions are essentially constant variables that represent operators. This includes the less than or greater than sign or equal sign or such as if, and else. Aside from the manifest constants the declarations also include identifiers and their regular expression match. This might be something such as letter [A-Za-z] to match to any letter, or id {letter}({letter}|{digit})* which matches any variable name. This one is a good example because it uses the previous declarations of letter and digit.

Lex – Translation rules

The translation rules have the form of Patter { Action }. These allow the language to decide what to do with a pattern once it is recognized. These patterns can use the regular definitions from the declarations to match and/or simple regular expressions.

Lex – Auxiliary functions

The last section, auxiliary functions, is a section that holds whatever additional functions will be used for actions within the translation rules.

General Lexical Generators

This build file however is specific to their lexer. There is no true standard to how these build files are built, so it is very important when building a generator that you document the build procedure in depth. Whether you build a file is written in the supporting language or in a language in and of itself, it’s important for the users who follow to know how the lexer works.

The discussion that follows the build file is the implementation strategy. For this, there are many strategies to look to. The typical lexer that you may see is an NFA or Enhanced NFA construction, there are others though such as derivative-based construction and DFA construction. Many of these we will look into and discuss in other threads.


Guides to each construction method


Books

Listed below are a few of my favorite books that taught me a fair amount of compiler construction. If you have books that benefited you on this topic, please share them in the comments below and expand this list!