Flexer

The flexer is a finite-automata-based engine for the definition and generation of lexers. Akin to flex, and other lexer generators, the user may use it to define a series of rules for lexing their language, which are then used by the flexer to generate a highly-efficient lexer implementation.

Where the flexer differs from other programs in this space, however, is the power that it gives users. When matching a rule, the flexer allows its users to execute arbitrary Rust code, which may even manipulate the lexer’s state and position. This means that the languages that can be lexed by the flexer extend from the simplest regular grammars right up to unrestricted grammars (but please don’t write a programming language whose syntax falls into this category). It also differs in that it chooses the first complete match for a rule, rather than the longest one, which makes lexers much easier to define and maintain.

For detailed library documentation, please see the crate documentation itself. This includes a comprehensive tutorial on how to define a lexer using the flexer.

The Lexing Process

In the flexer, the lexing process proceeds from the top to the bottom of the user-defined rules, and selects the first expression that matches fully. Once a pattern has been matched against the input, the associated code is executed and the process starts again until the input stream has been consumed.

This point about matching fully is particularly important to keep in mind, as it differs from other lexer generators that tend to prefer the longest match instead.

Lexing Rules

A lexing rule for the flexer is a combination of three things:

  1. A group.
  2. A pattern.
  3. A transition function.

An example of defining a rule is as follows:

fn define() -> Self {
    let mut lexer     = TestLexer::new();
    let a_word        = Pattern::char('a').many1();
    let root_group_id = lexer.initial_state;
    let root_group    = lexer.groups_mut().group_mut(root_group_id);
    // Here is the rule definition.
    root_group.create_rule(&a_word,"self.on_first_word(reader)");
    lexer
}

Groups

A group is a mechanism that the flexer provides to allow grouping of rules together. The flexer has a concept of a “state stack”, which records the currently active state at the current time, that can be manipulated by the user-defined transition functions.

A state can be made active by using flexer::push_state(state), and can be deactivated by using flexer::pop_state(state) or flexer::pop_states_until(state). In addition, states may also have parents, from which they can inherit rules. This is fantastic for removing the need to repeat yourself when defining the lexer.

When inheriting rules from a parent group, the rules from the parent group are matched strictly after the rules from the child group. This means that groups are able to selectively “override” the rules of their parents. Rules are still matched in order for each group’s set of rules.

Patterns

Rules are defined to match patterns. Patterns are regular-grammar-like descriptions of the textual content (as characters) that should be matched. For a description of the various patterns provided by the flexer, see pattern.rs.

When a pattern is matched, the associated transition function is executed.

Transition Functions

The transition function is a piece of arbitrary rust code that is executed when the pattern for a given rule is matched by the flexer. This code may perform arbitrary manipulations of the lexer state, and is where the majority of the power of the flexer stems from.

Code Generation

While it would be possible to interpret the flexer definition directly at runtime, this would involve far too much dynamicism and non-cache-local lookup to be at all fast.

Instead, the flexer includes generate.rs, a library for generating highly-specialized lexer implementations based on the definition provided by the user. The transformation that it implements operates as follows for each group of rules.

  1. The set of rules in a group is used to generate a Nondeterministic Finite Automaton, (NFA).
  2. The NFA is ttransformed into a Deterministic Finite Automaton (DFA), using a variant of the standard powerset construction algorithm. This variant has been modified to ensure that the following additional properties hold:
    • Patterns are matched in the order in which they are defined.
    • The associated transition functions are maintained correctly through the transformation.
    • The lexing process is O(n), where n is the size of the input.
  3. The DFA is then used to generate the rust code that implements that lexer.

The generated lexer contains a main loop that consumes the input stream character-by-character, evaluating what is effectively a big match expression that processes the input to evaluate the user-provided transition functions as appropriate.

Automated Code Generation

In order to avoid the lexer definition getting out of sync with its implementation (the generated engine), it is necessary to create a separate crate for the generated engine that has the lexer definition as one of its dependencies.

This separation enables a call to flexer::State::specialize() in the crate’s build.rs (or a macro) during compilation. The output can be stored in a new file i.e. engine.rs and exported from the library as needed. The project structure would therefore appear as follows.

- lib/rust/lexer/
  - definition/
    - src/
      - lib.rs
    - cargo.toml

  - generation/
    - src/
      - engine.rs <-- the generated file
      - lib.rs    <-- `pub mod engine`
    - build.rs    <-- calls `flexer::State::specialize()` and saves its output to
                      `src/engine.rs`
    - cargo.toml <-- lexer-definition is in dependencies and build-dependencies

With this design, flexer.generate_specialized_code() is going to be executed on each rebuild of lexer/generation. Therefore, generation should contain only the minimum amount of logic, and should endeavor to minimize any unnecessary dependencies to avoid recompiling too often.

Structuring the Flexer Code

In order to unify the API between the definition and generated usages of the flexer, the API is separated into the following components:

  • Flexer: The main flexer definition itself, providing functionality common to the definition and implementation of all lexers.
  • flexer::State: The stateful components of a lexer definition. This trait is implemented for a particular lexer definition, allowing the user to store arbitrary data in their lexer, as needed.
  • User-Defined Lexer: The user can then define a lexer that wraps the flexer, specialised to the particular flexer::State that the user has defined. It is recommended to implement Deref and DerefMut between the defined lexer and the Flexer, to allow for ease of use.

Supporting Code Generation

This architecture separates out the generated code (which can be defined purely on the user-defined lexer), from the code that is defined as part of the lexer definition. This means that the same underlying structures can be used to both define the lexer, and be used by the generated code from that definition.

For an example of how these components are used in the generated lexer, please see generated_api_test.