My struggles with parser combinators
Recursive descent parsing is a popular approach to parsing. Roughly speaking, it uses functions which take input (a string or a byte stream, for example), consume a prefix of it, and return some output (a data structure or a language AST).
These functions will usually call each other: a JSON array needs to
parse arbitrary JSON values it has inside. This gives rise to parser
combinators. What if instead of having to write all these parsing
functions by hand we had a convenient way of combining them? Then you
can just call separated(parse_json_value, ",")
, and that’d
be it for an array parser. And a parser combinator library is a package
which provides such combinators and, usually, commonly used parser
functions, like integer parsers.
Over the last few months I was tinkering with a Rust parser combinator library in my time off. There were already quite a few of those, so I could rely on their design and implementations for inspiration.
Excluding a bit of a lifetime mess I got myself into during earlier
iterations, the library turned out to be pretty basic. It has one
foundational trait, Parser<I, O, E>
, which contains
one function:
parse(I) -> Result<(O, I), E>
. Thus, a parser is
an object which takes input and returns its output and the rest of the
input, excluding some prefix the parser consumed. This trait is
blanket-implemented for Fn
types with the same signature,
which made creating parser combinators relatively easy:
pub const fn delimited<I: Copy, OL, O, OR, E>(
: impl Parser<I, OL, E>,
left: impl Parser<I, O, E>,
content: impl Parser<I, OR, E>,
right-> impl Parser<I, O, E> {
) move |input| {
let (_, rest) = left.parse(input)?;
let (out, rest) = content.parse(rest)?;
let (_, rest) = right.parse(rest)?;
Ok((out, rest))
}
}
This function takes three parser objects and returns a closure, which applies them in order. If even a single parser fails, the error is bubbled up. Otherwise the output of the middle parser is returned.
Implementing concrete parsers was also easy:
#[diagnostic::do_not_recommend]
impl<'i> Parser<&'i str, &'i str, Error<'i>> for &str {
fn parse(
&self,
: &'i str,
input-> PResult<&'i str, &'i str, Error<'i>> {
) if input.starts_with(self) {
let length = self.len();
Ok((&input[..length], &input[length..]))
} else if input.len() < self.len() {
Err(Error::end(input))
} else {
Err(Error::unmatched(&input[..self.len()]))
}
}
}
This allows using a string literal as a parser. On success it returns1 the sub-slice of the input string matching the literal, which I envisioned as a convenient way to preserve span information2.
Additionally, function pointers also implement the
Fn
trait, plain functions could also be used as
parsers:
pub fn line_end(input: &str) -> PResult<&str, &str, Error> {
"\n", "\r\n")).parse(input)
choice((}
Combining all of the above we can create a simple parser:
let my_parser = delimited("Hello, ", "world", line_end);
assert_eq!(Ok(("world", "")), my_parser.parse("Hello, world\n"));
And from those simple building blocks arbitrarily complex but efficient parsers can be made. At least, that was the idea. The reality turned out to be more complicated.
Construction overhead
Let me introduce a bit of terminology and define function
parsers and factory parsers. Function parsers are
concrete objects, on which the parse
function can be
called. It includes the above line_end
parser or string
literals, since using them is equivalent to calling
Parser::parse(literal, input)
. Function parsers are
static. All of their logic is set in stone at compile time.
Factory parsers create function parsers. delimited
from
above is a factory. When called, it creates a closure, which can act as
a function parser. Importantly, factories are dynamic: they create
parsers at runtime. This means that they are more flexible, allowing
configuring parsers on the fly3. But it also means they
are less efficient, as they have to do additional work at runtime.
Here is an example of that inefficiency. Below is a parser, which is supposed to parse a JSON string4.
fn string(input: &str) -> PResult<&str, String, Error> {
let character = choice((
"\\".value('\"'),
"\\\\".value('\\'),
"\\/".value('/'),
"\\b".value('\x08'),
"\\f".value('\x0C'),
"\\n".value('\n'),
"\\r".value('\r'),
"\\t".value('\t'),
&['\\', '"'])
none_of_char(.map_out(|s| s.chars().next().unwrap()),
;
))
let p = fold(character, String::new, |acc, ch| acc.push(ch));
"\"", p, "\"").coerce().parse(input)
delimited(}
string
is a parser function. The problem is that it
uses factories underneath. Every single time it’s called, it creates 9
escape parsers (the .value
method returns a closure, which
replaces the output of the parser with the given value) wrapped in a
choice
combinator, which in turn is wrapped in a
delimited
combinator. And all this work is done every
single time we need to parse a string.
Ideally, all these combinators would be initialized once. Luckily,
since Rust doesn’t spill closures to the heap, most of the logic can be
executed at compile time. That’s why the delimited
parser
above was annotated const
: I was trying to find a way to
initialize the parsers once5. Unluckily, Rust
closures are unnameable, so it’s impossible to create a
const PARSER
(it’s type can’t be named).
Now, there is a solution to this problem: make all of your parsers factories. This way the whole parser will only be initialized once, avoiding all of this overhead. Which brings us to woe two.
Recursion
A lot of grammars are recursive. This includes most programming languages and, notably, JSON, for which I was writing an example parser. So, as I was converting my parser from functions to factories, I hit a SIGSEGV6. Which, given that I’m writing safe7 Rust, can only mean one thing: infinite recursion.
There are various ways of dealing with it:
pom
hascall
, which creates a new lazy parser: it only constructs the underlying parser when it’s called. The problem is that it puts us back to square one: now each time we run acall
-created parser, we have to pay the underlying factory costs.chumsky
hasrecursive
, which uses aOnceCell
underneath to lazily create the parser.
I believe nom
and winnow
don’t have a
solution and simply eat the cost of parser construction by using
function parsers.
Parsing modes
nom
maintainer, Mr Couprie, has been working on Rust
parser combinators for more than 10 years now8. I
think that’s why They were the first9 to come up with the idea
for output modes for parser combinators.
Here’s the pitch. Most parser combinators return some output and an
error. But when we are matching some scaffold tokens, like a comma
,
in JSON, we don’t care for the errors and the output. We
only care about the binary outcome: did the parser succeed or not? In
this case the return values of the parsers are pure overhead. Vectors
have to be allocated for collectors, bulky return structs have to be
spilled on the stack.
Output modes solve this problem by adding a generic type to parser
combinators. It defines at compile time wherever the struct will return
concrete output and errors or just stub them with a ()
.
While I think it’s a very smart idea, the type juggling required to
achieve this efficiency is non-trivial. Here’s a jolly type alias from
nom
, illustrating what I mean:
type PResult<OM, I, O, E> = Result<(I, <<OM as OutputMode>::Output as Mode>::Output<O>), Err<E, <<OM as OutputMode>::Error as Mode>::Output<E>>>;
In practice, all this alias does is switch output and error types
depending on what the output mode is. But a bunch of nested traits10 and verbose fully-qualified path
syntax make it into quite a monster. I took me writing an entire parser
combinator library to understand how nom
output modes
worked.
When I was writing my own library, I tried to support output modes by
overloading the Parser
implementations. For example, I
implemented Parser
for char
with several
outputs.
char
: returns itself, for use in string collector combinators.&str
: preserves the span of the char.()
: check mode.
This flexibility caused another issue, however.
Type inference errors
I’ve saved the most notorious problem for last. I think that everyone who has worked with Rust parser combinators is familiar with the huge error dumps. So, instead of reiterating these issues, I’ll list a couple of ways in which I tried to remedy them.
- Implementation diagnostics
-
Rust compiler is really helpful, but when it comes to trait implementations, I think that it tries too hard. Concretely, when a type doesn’t satisfy a trait bound (
Parser<I, O, E>
in my case), the compiler will suggest other types, which implement that trait. All of them. Now, it does only show 5 of them, but even 5 signatures can be very big and noisy.What this means in practice is that compiler dumps a bunch of unimportant information instead of using the default error, which says something along the lines of: “Got
Parser<&str, &str, Error>
instead ofParser<&str, (), Error>
”.Thankfully, Rust 1.85 brings a solution. You might’ve noticed it at the beginning of this post, but the
Parser
implementation for&str
was annotated with#[diagnostic::do_not_recommend]
. I had to annotate all implementations with it, but in return the compiler correctly fell back on the default error, which actually named the trait of the parser you were trying to use. - Concrete types
-
Due to the unnameable nature of the closures, compiler errors which feature them are not very nice. Early on I tried to do what
pom
does and wrapped all parsers into aParser
struct which carried an innerdyn ParserTrait
box. But that had other drawbacks and also I don’t remember it improving the errors.What did help, though, was creating a new
struct
for each combinator factory. Sodelimited
would return aDelimited
struct, which wrapped 3 other parsers and implementedParser
itself, and so on. It also had the added benefit of allowing to deriveClone
on these structs. You can’tclone
animpl Parser
type, even if the underlying type can be cloned, which caused issues11.The issue is that this approach required tons of boilerplate. And causes type inference errors. Due to the fact that unconstrained generics aren’t allowed in
impl
s, (see errorE0207
) some structs, likeInto
12, had to carryPhantomData
markers to fix that. This, ironically, would worsen the errors and also made type inference even more brittle.
Conclusion
I picked parser combinators because I thought that they’d be easy: something to tinker on in my free time. But it turns out that efficiently implementing them is a complicated task.
I also feel like I’ve missed the forest for the trees. While working on my library I spent a lot of time thinking about parser combinators and less time time thinking about parsing itself. The text formats I’ve worked with only really do two things: match some simple surrounding tokens and collect the results of other parsers into lists or maps. I think that I’d be better off spending more time optimizing prefix matching and folding, rather than working on the combinator machinery.
So, for now I’ll look beyond combinators and into other recursive descent parsing strategies1314.
The
PResult
output type is just an alias:pub type PResult<I, O, E> = Result<(O, I), E>;
I
andO
is a bit counter intuitive, but I liked the fact that I could writelet (output, rest) = p.parse(...)
.↩︎Typically, an AST tree has to carry both the token and its span. My thinking was: what if instead of that we’d just have a single
&str
reference? This carries both the type (by comparing it with another string literal, i.e.token == "static"
) and the span, which can be extracted using a technique fromsubstr_range
.↩︎One might question why we would ever need to configure parsers at runtime. Rust raw string literals are a real-word example. Since the code can write them with an arbitrary amount of hash characters, the parser has to count their number at runtime and can only terminate the string when it finds the same number somewhere down the line.↩︎
I omitted the handling of Unicode escapes
\u
for clarity (and because that particular code wasn’t very good).↩︎Technically, I could’ve used caching using
OnceLock
, but then I’d have to create a parser registry and juggle all of the references correctly (since some factories can take lifetime-bounded arguments). I decided that this approach was too messy.↩︎I believe Rust on GNU C library has some recursion protection, but I’m on
musl
, so I simply get a SIGSEGV.↩︎Well, that’s not actually true. I had one function, which used unsafe to create
char
s. But I was sure that it wasn’t the problem. And I knew that recursion causes SIGSEGV termination, because I had hit this problem back in January with apom
-like concrete Parser struct with adyn
inside design.↩︎First
nom
commit dates back to 27th of October, 2014.↩︎To the best of my knowledge and for Rust.↩︎
Note that these traits are not redundant. Error and output have to be configured separately, so there has to be a
Mode
implementation for both of them.↩︎In practice this meant that it was impossible to reuse parser factory results, forcing me to re-create the parsers over and over, worsening the construction overhead issue. I did implement a method named
clone
, but it wasn’t a true clone. Instead, it created a closure which took a reference to the original parser object. I thinkreference
would’ve been a better name. It allowed me to reuse the parsers, but was brittle: after a singleclone
you could no longer move the original parser because of the created reference. And it was yet another layer of pointer indirection.↩︎It was actually called
coerce
/Coerce
, because the factory was implemented as an extension method, andinto
would conflict with the standard libraryinto
method.↩︎My current favorite is the
peg
library. It stood out to me because it used proc macro generation, but allowed mutating the results using inline Rust. I think the code it generates it pretty good. And fast, too, since they reduce input position to a singleusize
variable and avoid error and output objects for operations which don’t need them.↩︎Beyond Rust, Zig is probably a great language for making parser combinators.
comptime
solves the construction overhead issue (well, except recursion, but I’m not sure how Zig deals with recursive types). And there’s even an active library. But I’m currently too full of parser combinators to do a deep dive.↩︎