May 2026
githubMy most accomplished project to this day. ft_lex introduced me to the world of automatons and compilers theory. I’m actually looking forward for the following project ft_yacc.
A full reimplementation of the lex generator, targeting POSIX compliance. Takes a .l specification as input and produces a complete C scanner - a transition table, a yylex() driver, and a companion libl for linking - exactly mirroring the interface of the original tool.
The pipeline follows classical compiler construction: the lex spec is parsed into rules and macros, each regex is converted to postfix via shunting-yard, compiled into an NFA (Nondeterministic Finite Automaton) using Thompson’s construction, then all NFAs are merged and converted to a single DFA (Deterministic Finite Automaton) via subset construction. Character classes are represented as 128-bit ASCII bitsets throughout, avoiding expansion into large alternations. The generated tables support an optional compressed representation using a three-stage compression pipeline to reduce memory footprint at the cost of an indirect lookup.
The yylex() runtime is written as a C template embedded at build time via xxd -i, instantiated by marker substitution during code generation. It supports the full standard interface: start conditions (inclusive and exclusive), BOL (beginning of line) anchors, yymore(), REJECT, %array/%pointer modes, and trailing contexts, fixed-length or variable-length (isolated DFA simulated at runtime).
Complete pipeline.
Understanting automata theory from scratch. NFAs, DFAs, epsilon-closure, subset construction - none of this was prior knowledge. The theory had to be built before the code could be written, which meant working through the Dragon Book (Compilers: Principles, Techniques, and Tools, Aho, Sethi, Ullman) and other references (and making a lot of poorly drawn schemas) before any implementation started.
NFA vs DFA representation of ab|a.
Evolving the character class representation. Early versions had no dedicated character class concept - a pattern like [a - z] compiled to a chain of 26 alternations, which exploded NFA size. A first attempt at a compact representation using vector<bool> introduced subtle bugs in the NFA creation. The final design - a bitset<128> attached directly to tokens - is both compact and correct.