My struggles with parser combinators

Recursive descent parsing is a popular approach to parsing. Roughly speaking, it uses functions which take input (a string or a byte stream, for example), consume a prefix of it, and return some output (a data structure or a language AST).

These functions will usually call each other: a JSON array needs to parse arbitrary JSON values it has inside. This gives rise to parser combinators. What if instead of having to write all these parsing functions by hand we had a convenient way of combining them? Then you can just call separated(parse_json_value, ","), and that’d be it for an array parser. And a parser combinator library is a package which provides such combinators and, usually, commonly used parser functions, like integer parsers.

Over the last few months I was tinkering with a Rust parser combinator library in my time off. There were already quite a few of those, so I could rely on their design and implementations for inspiration.

Excluding a bit of a lifetime mess I got myself into during earlier iterations, the library turned out to be pretty basic. It has one foundational trait, Parser<I, O, E>, which contains one function: parse(I) -> Result<(O, I), E>. Thus, a parser is an object which takes input and returns its output and the rest of the input, excluding some prefix the parser consumed. This trait is blanket-implemented for Fn types with the same signature, which made creating parser combinators relatively easy:

pub const fn delimited<I: Copy, OL, O, OR, E>(
    left: impl Parser<I, OL, E>,
    content: impl Parser<I, O, E>,
    right: impl Parser<I, OR, E>,
) -> impl Parser<I, O, E> {
    move |input| {
        let (_, rest) = left.parse(input)?;
        let (out, rest) = content.parse(rest)?;
        let (_, rest) = right.parse(rest)?;

        Ok((out, rest))
    }
}

This function takes three parser objects and returns a closure, which applies them in order. If even a single parser fails, the error is bubbled up. Otherwise the output of the middle parser is returned.

Implementing concrete parsers was also easy:

#[diagnostic::do_not_recommend]
impl<'i> Parser<&'i str, &'i str, Error<'i>> for &str {
    fn parse(
        &self,
        input: &'i str,
    ) -> PResult<&'i str, &'i str, Error<'i>> {
        if input.starts_with(self) {
            let length = self.len();
            Ok((&input[..length], &input[length..]))
        } else if input.len() < self.len() {
            Err(Error::end(input))
        } else {
            Err(Error::unmatched(&input[..self.len()]))
        }
    }
}

This allows using a string literal as a parser. On success it returns¹ the sub-slice of the input string matching the literal, which I envisioned as a convenient way to preserve span information².

Additionally, function pointers also implement the Fn trait, plain functions could also be used as parsers:

pub fn line_end(input: &str) -> PResult<&str, &str, Error> {
    choice(("\n", "\r\n")).parse(input)
}

Combining all of the above we can create a simple parser:

let my_parser = delimited("Hello, ", "world", line_end);

assert_eq!(Ok(("world", "")), my_parser.parse("Hello, world\n"));

And from those simple building blocks arbitrarily complex but efficient parsers can be made. At least, that was the idea. The reality turned out to be more complicated.

Construction overhead

Let me introduce a bit of terminology and define function parsers and factory parsers. Function parsers are concrete objects, on which the parse function can be called. It includes the above line_end parser or string literals, since using them is equivalent to calling Parser::parse(literal, input). Function parsers are static. All of their logic is set in stone at compile time.

Factory parsers create function parsers. delimited from above is a factory. When called, it creates a closure, which can act as a function parser. Importantly, factories are dynamic: they create parsers at runtime. This means that they are more flexible, allowing configuring parsers on the fly³. But it also means they are less efficient, as they have to do additional work at runtime.

Here is an example of that inefficiency. Below is a parser, which is supposed to parse a JSON string⁴.

fn string(input: &str) -> PResult<&str, String, Error> {
    let character = choice((
        "\\".value('\"'),
        "\\\\".value('\\'),
        "\\/".value('/'),
        "\\b".value('\x08'),
        "\\f".value('\x0C'),
        "\\n".value('\n'),
        "\\r".value('\r'),
        "\\t".value('\t'),
        none_of_char(&['\\', '"'])
            .map_out(|s| s.chars().next().unwrap()),
    ));

    let p = fold(character, String::new, |acc, ch| acc.push(ch));

    delimited("\"", p, "\"").coerce().parse(input)
}

string is a parser function. The problem is that it uses factories underneath. Every single time it’s called, it creates 9 escape parsers (the .value method returns a closure, which replaces the output of the parser with the given value) wrapped in a choice combinator, which in turn is wrapped in a delimited combinator. And all this work is done every single time we need to parse a string.

Ideally, all these combinators would be initialized once. Luckily, since Rust doesn’t spill closures to the heap, most of the logic can be executed at compile time. That’s why the delimited parser above was annotated const: I was trying to find a way to initialize the parsers once⁵. Unluckily, Rust closures are unnameable, so it’s impossible to create a const PARSER (it’s type can’t be named).

Now, there is a solution to this problem: make all of your parsers factories. This way the whole parser will only be initialized once, avoiding all of this overhead. Which brings us to woe two.

Recursion

A lot of grammars are recursive. This includes most programming languages and, notably, JSON, for which I was writing an example parser. So, as I was converting my parser from functions to factories, I hit a SIGSEGV⁶. Which, given that I’m writing safe⁷ Rust, can only mean one thing: infinite recursion.

There are various ways of dealing with it:

pom has call, which creates a new lazy parser: it only constructs the underlying parser when it’s called. The problem is that it puts us back to square one: now each time we run a call-created parser, we have to pay the underlying factory costs.
chumsky has recursive, which uses a OnceCell underneath to lazily create the parser.

I believe nom and winnow don’t have a solution and simply eat the cost of parser construction by using function parsers.

Parsing modes

nom maintainer, Mr Couprie, has been working on Rust parser combinators for more than 10 years now⁸. I think that’s why They were the first⁹ to come up with the idea for output modes for parser combinators.

Here’s the pitch. Most parser combinators return some output and an error. But when we are matching some scaffold tokens, like a comma , in JSON, we don’t care for the errors and the output. We only care about the binary outcome: did the parser succeed or not? In this case the return values of the parsers are pure overhead. Vectors have to be allocated for collectors, bulky return structs have to be spilled on the stack.

Output modes solve this problem by adding a generic type to parser combinators. It defines at compile time wherever the struct will return concrete output and errors or just stub them with a ().

While I think it’s a very smart idea, the type juggling required to achieve this efficiency is non-trivial. Here’s a jolly type alias from nom, illustrating what I mean:

type PResult<OM, I, O, E> = Result<(I, <<OM as OutputMode>::Output as Mode>::Output<O>), Err<E, <<OM as OutputMode>::Error as Mode>::Output<E>>>;

In practice, all this alias does is switch output and error types depending on what the output mode is. But a bunch of nested traits¹⁰ and verbose fully-qualified path syntax make it into quite a monster. I took me writing an entire parser combinator library to understand how nom output modes worked.

When I was writing my own library, I tried to support output modes by overloading the Parser implementations. For example, I implemented Parser for char with several outputs.

char: returns itself, for use in string collector combinators.
&str: preserves the span of the char.
(): check mode.

This flexibility caused another issue, however.

Type inference errors

I’ve saved the most notorious problem for last. I think that everyone who has worked with Rust parser combinators is familiar with the huge error dumps. So, instead of reiterating these issues, I’ll list a couple of ways in which I tried to remedy them.

Implementation diagnostics

Rust compiler is really helpful, but when it comes to trait implementations, I think that it tries too hard. Concretely, when a type doesn’t satisfy a trait bound (Parser<I, O, E> in my case), the compiler will suggest other types, which implement that trait. All of them. Now, it does only show 5 of them, but even 5 signatures can be very big and noisy.

What this means in practice is that compiler dumps a bunch of unimportant information instead of using the default error, which says something along the lines of: “Got Parser<&str, &str, Error> instead of Parser<&str, (), Error>”.

Thankfully, Rust 1.85 brings a solution. You might’ve noticed it at the beginning of this post, but the Parser implementation for &str was annotated with #[diagnostic::do_not_recommend]. I had to annotate all implementations with it, but in return the compiler correctly fell back on the default error, which actually named the trait of the parser you were trying to use.

Concrete types

Due to the unnameable nature of the closures, compiler errors which feature them are not very nice. Early on I tried to do what pom does and wrapped all parsers into a Parser struct which carried an inner dyn ParserTrait box. But that had other drawbacks and also I don’t remember it improving the errors.

What did help, though, was creating a new struct for each combinator factory. So delimited would return a Delimited struct, which wrapped 3 other parsers and implemented Parser itself, and so on. It also had the added benefit of allowing to derive Clone on these structs. You can’t clone an impl Parser type, even if the underlying type can be cloned, which caused issues¹¹.

The issue is that this approach required tons of boilerplate. And causes type inference errors. Due to the fact that unconstrained generics aren’t allowed in impls, (see error E0207) some structs, like Into¹², had to carry PhantomData markers to fix that. This, ironically, would worsen the errors and also made type inference even more brittle.

Conclusion

I picked parser combinators because I thought that they’d be easy: something to tinker on in my free time. But it turns out that efficiently implementing them is a complicated task.

I also feel like I’ve missed the forest for the trees. While working on my library I spent a lot of time thinking about parser combinators and less time time thinking about parsing itself. The text formats I’ve worked with only really do two things: match some simple surrounding tokens and collect the results of other parsers into lists or maps. I think that I’d be better off spending more time optimizing prefix matching and folding, rather than working on the combinator machinery.

So, for now I’ll look beyond combinators and into other recursive descent parsing strategies¹³ ¹⁴.

The PResult output type is just an alias:
```
pub type PResult<I, O, E> = Result<(O, I), E>;
```
I and O is a bit counter intuitive, but I liked the fact that I could write let (output, rest) = p.parse(...).↩︎
Typically, an AST tree has to carry both the token and its span. My thinking was: what if instead of that we’d just have a single &str reference? This carries both the type (by comparing it with another string literal, i.e. token == "static") and the span, which can be extracted using a technique from substr_range.↩︎
One might question why we would ever need to configure parsers at runtime. Rust raw string literals are a real-word example. Since the code can write them with an arbitrary amount of hash characters, the parser has to count their number at runtime and can only terminate the string when it finds the same number somewhere down the line.↩︎
I omitted the handling of Unicode escapes \u for clarity (and because that particular code wasn’t very good).↩︎
Technically, I could’ve used caching using OnceLock, but then I’d have to create a parser registry and juggle all of the references correctly (since some factories can take lifetime-bounded arguments). I decided that this approach was too messy.↩︎
I believe Rust on GNU C library has some recursion protection, but I’m on musl, so I simply get a SIGSEGV.↩︎
Well, that’s not actually true. I had one function, which used unsafe to create chars. But I was sure that it wasn’t the problem. And I knew that recursion causes SIGSEGV termination, because I had hit this problem back in January with a pom-like concrete Parser struct with a dyn inside design.↩︎
First nom commit dates back to 27th of October, 2014.↩︎
To the best of my knowledge and for Rust.↩︎
Note that these traits are not redundant. Error and output have to be configured separately, so there has to be a Mode implementation for both of them.↩︎
In practice this meant that it was impossible to reuse parser factory results, forcing me to re-create the parsers over and over, worsening the construction overhead issue. I did implement a method named clone, but it wasn’t a true clone. Instead, it created a closure which took a reference to the original parser object. I think reference would’ve been a better name. It allowed me to reuse the parsers, but was brittle: after a single clone you could no longer move the original parser because of the created reference. And it was yet another layer of pointer indirection.↩︎
It was actually called coerce/Coerce, because the factory was implemented as an extension method, and into would conflict with the standard library into method.↩︎
My current favorite is the peg library. It stood out to me because it used proc macro generation, but allowed mutating the results using inline Rust. I think the code it generates it pretty good. And fast, too, since they reduce input position to a single usize variable and avoid error and output objects for operations which don’t need them.↩︎
Beyond Rust, Zig is probably a great language for making parser combinators. comptime solves the construction overhead issue (well, except recursion, but I’m not sure how Zig deals with recursive types). And there’s even an active library. But I’m currently too full of parser combinators to do a deep dive.↩︎