Rust is forward safe

Backward compatibility is a known term in computing. I don't believe there is a universal strict definition, but I like the one given by Wikipedia because of how general it is:

A property of an operating system, software, real-world product, or technology that allows for interoperability with an older legacy system, or with input designed for such a system.

One place where backward compatibility is taken very seriously is language and compiler development. All of the biggest programming languages have a very strict backward compatibility policy. Most of them guarantee that code written decades ago can still be compiled, with Python being the only exception I know of¹.

There is also another term, forward compatibility, which makes a similar guarantee in the opposite direction: a forward compatible piece of software must work with all of the future input. This is a much, much tougher guarantee to make, as it constrains design decisions and forces all of the current code to consider a vast number of possible scenarios and changes. This is highlighted by the fact that no mainstream programming language² makes this promise. In fact, I only know of one in-development language which at least aspires to be forward compatible³.

With all of the above in mind, allow me to conjure up a new term: backward safety. I'll define it a property of code which is memory safe⁴ when used together with older code and inputs designed for it or older code.

To give an example, a C++ module without any undefined behavior or with some execution paths which might trigger UB, but are unreachable if the right inputs are supplied, is backward safe. Such code can be called and composed from other places and, as long its invariants are satisfied, it'll be completely memory safe. Here's a basic example:

// INVARIANT: `nodes` must contain at least one non-root node.
void prune(std::vector<int>& nodes, int root) {
    if (nodes.back() == root) {
        nodes.pop_back();
    }
    nodes.pop_back();
}

This function will pop the last element of a vector, removing the root if it was still present. As long as the vector has non-root nodes, it's safe. But if the vector is empty or consists only of one root node calling prune will trigger UB. In my testing case calling it with the vector [ROOT] made the length wraparound to 2⁶⁴ - 1, causing the printing function to plow through uninitialized memory until it hit a segmentation fault.

But here, just as with compatibility, we can reverse the direction of the guarantee. Let's call code forward safe if it is memory safe when called from any place with any valid values. And again, just as with compatibility, this is a stronger guarantee which is harder to provide. Safe Rust is, notably⁵, memory safe.

Here's the same function rewritten in Rust:

fn prune(nodes: &mut Vec<usize>, root: usize) {
    if nodes.last().unwrap() == root {
        nodes.pop();
    }
    nodes.pop();
}

This function is backward memory safe, same as the C++ version. But it is also forward safe, despite not performing any edge case checks. Passing an invalid node to prune will cause a panic, but it won't corrupt any objects or try to read out-of-bounds memory.

Now, one might argue that both of those functions are poorly written and if written well, they'd compile down to basically the same instructions. I agree! If written well, both functions would be performing the invariant checks themselves. But what makes Rust special here is that it protects us in two ways:

At compile time, by rejecting code the compiler cannot prove is always memory safe.
At runtime, when a mistake does slip through and a certain execution path is hit, Rust will panic instead of triggering undefined behavior, (for which hitting an immediate SIGSEGV would be the best scenario).

The former is very important, as it forces Rust programmers to be explicit about invariants. If there are non-trivial memory lifetimes shared between function arguments, they must be described. If there is a value invariant, it often has to be encoded into the type system.

The second point is also very helpful. Often times software will have complex invariants which can't (or would be too bothersome to) be encoded in the type system. Sometimes there'll be edge cases in the input the programmer has failed to consider. When an error does slip through, Rust limits the blast radius.

The way I see it, the guarantee of forward safety in Rust shifts around the complexity of writing memory safe code. In C, C++, and Zig it's easier to write brand-new code. But using it can be tough, because the caller must be aware of all the invariants the callee requires (and hope that all of them have been properly documented).

In Rust, on the other hand, writing new functions is tougher. A safe function must not violate memory safety for all possible input value combinations and all possible order of calls across any amount of threads. I often see complaints online that Rust compiler rejects valid code because it can't prove it is safe. What I think happens in a lot of those cases is that the compiler has rejected the code because it can't only be safe in the context of that particular program. It must be safe for any future changes to the code! That's often a non-trivial requirement to fulfill⁶.

But all those requirements make composing code and refactoring much easier. I can call my own code and rely on third-party safe libraries⁸ without being worried that I'll cause a memory CVE by accidentally breaking an explicit or an implicit invariant.

While I was writing this blog post, a notable event happened. Cloudflare went down for 3 hours, bringing a sizable chunk of internet with it. The error turned out to come from an unhandled panic in Rust code, prompting a lot of discussion about design choices Rust has made. While the trade offs of panicking are off-topic for this post⁹, I want to discuss why Cloudflare moved to Rust in the first place. After all, I don't believe they are the type of company to get swept up by trendy technology. And there are plenty of very experienced C++ developers at Cloudflare. Some of their core services, such as workers, are implemented in C++. Given all of this, why would Cloudflare switch to Rust only a couple of years after it has become stable?

As it turns out, one of the core reasons for this also serves as a great real world example of what I tried to describe earlier in this post. Except this time it's not a toy function, but real-life production code which was serving billions of requests.

On the 17th of February 2017 Tavis Ormandy, a security researcher working as a part of Google's Project Zero at the time, discovered that Cloudflare proxies were dumping uninitialized memory into responses.

As is standard procedure for them, Cloudflare published a detailed postmortem, from which we can find out what happened. Back then Cloudflare was still using NGINX proxies¹⁰. They also provided several features which involved rewriting the HTML of the responses, which were implemented as NGINX modules. All of them shared a single HTML parser written using Ragel, a state machine compiler. It compiles high-level regular expressions into C. The resulting code made liberal use of goto because it was a state machine. Now, that by itself wasn't a problem: code generated by Ragel is backwards safe.

But it wasn't forward safe. The generated code used two pointers: p, the current position in the input, and pe, the end of input. This introduced an implicit invariant: p must never overrun pe¹¹. And if p would for whatever reason jump over pe, the state machine would trigger undefined behavior. As it happened, Cloudflare's code using Ragel would cause exactly that. It's hard to put all of the blame on whoever wrote this code, though. Take a look:

script_consume_attr := ((unquoted_attr_char)* :>> (space|'/'|'>'))
>{ ddctx("script consume_attr"); }
@{ fhold; fgoto script_tag_parse; }
$lerr{ dd("script consume_attr failed");
       fgoto script_consume_attr; };

Can you spot the error? This is a Ragel file with braces containing inline C code. The issue is that $lerr, the error branch, doesn't call fhold, which prevents the character pointer from advancing. What this means is that in rare occasions, when an HTML response would end with a broken script tag such as <script type=, the parser would overrun the buffer. Now, this is a rare condition, but at Cloudflare's scale it would still get triggered from time to time. In this particular case undefined behavior turned out to behave quite well. Due to some particularities of the top-level parsing functions (one of the values in the buffers passed to the module function was set to 0) the parser would always skip the offending $lerr branch.

At least until another more modern and stream-supporting HTML parser got added. This changed the value which previously was always 0 to 1. Thus, allowing the buffer overflow to run rampant.

The reason I'm going into such details is to illustrate why I think panics are still better than UB. What happened here is a nightmare scenario, a perfect example of spooky action at a distance. A seemingly unrelated change triggered an error in a completely different part of the system.

The whole thing must've been a nightmare for Cloudflare, too, because they had to spend the subsequent few weeks chasing down all web caches they could find. The bug had leaked a lot of sensitive information which passed through Cloudflare's servers, including encryption keys, cookies, and various PII. Despite this people were still find bits and peaces of leaked information for some time after the public disclosure.

A bug like this would've been impossible in safe Rust, because the code would've had to use a string or a byte slice, and out of bounds indexing on those triggers panics. Given this, Cloudflare probably decided that the guarantees Rust gave were worthwhile, because only a year later they would be promoting their use of Rust and actively hiring software engineers¹².

To sum up, pretty much all systems languages can be backwards safe. But when such code is written, it tends to accumulate invariants, breaking which could have dire consequences. This means that responsibility gets shifted from the callee to the caller. Rust is forward safe. It puts all of the responsibilities on the callee. Every Rust function must be memory safe regardless of which inputs it gets or which thread it is called from. This makes it somewhat tougher to write individual functions, but it means it's easier to compose already written functions.

Now, it is absolutely possible to write memory safe code in C and C++ in a backward safe manner. This does leave it vulnerable to future changes or missed invariant violations. And Rust is one of the very few forward safe systems programming languages¹³. That's why I pick it over C and C++: I find it easier to compose complex systems and refactor in Rust, even if writing individual elements can be tougher.

Thanks to Alisa Sireneva for taking the time to review this post and providing important corrections and insight.

Aside from the Python 2 to 3 transition, which has now arguably been "decades ago" (Python 3 is about 7 years older than Rust 1.0 and 3 years older than Swift 1.0), Python is the only major language I know of that deprecates standard library modules. PEP 387 outlines Python's backward compatibility policy, which allows the maintainers to remove modules 5 years after they have been declared deprecated. ↩
That I know of. ↩
I'm talking about Hare. It's very basic and very opinionated, and also aims to be a "100-year programming language". It is also still in development and haven't reached 1.0, which is when these guarantees are supposed to kick in. ↩
Note that everything below will talk about memory safety, which is chiefly about 3 things:
- Elimination of double-free errors.
- No use-after-free or reads/writes from/to dangling pointers.
- No data races.
Safe Rust doesn't mean that the logic is correct or that an app can't be exploited. One can write a tool with SQL or shell injections in perfectly safe Rust. One can even write a program which will format their hard drive in perfectly safe Rust. So, for the rest of the post, when I say "safety", I mean memory safety. ↩
I agree with a common viewpoint that memory safety is one of the main reasons to use Rust and not just some subset of C++. Of course, ergonomics, ecosystem, and a powerful type system are all important. But I feel that a lot of those came about either to support memory safety or after Rust went mainstream (with a lot of promotion I remember being centered on it being a solution to memory safety bugs). ↩
This is also why I agree with the claim that unsafe Rust is harder to write. In C and C++ one might write a function and say that it only works for such and such inputs. A safe Rust function which uses unsafe must uphold memory safety for any valid input and must often⁷ be thread-safe to boot. ↩
It was pointed out to me that, since Rust provides us with !Sync types, which cannot be shared across threads, a function taking those doesn't have to concern itself with thread safety. Importantly, this is enforced by the compiler. Which means one can opt-out of making a function thread safe without being worried that some downstream caller will break this invariant. ↩
I think that's also one of the reasons people use external crates so much in Rust. It's easier to rely on code someone else wrote when they must fulfill a number of guarantees the compiler will check for you. ↩
In fact, my opinion here is pretty extreme. I believe that unwinding panics shouldn't have been a part of Rust in the first place: I'm not a fan of exceptions. But that's a pretty heterodox opinion. And it also isn't practically useful, since at this point panics are here to stay. ↩
They have since migrated to Pingora, a framework for building proxies/servers written in Rust. ↩
This is a very curious case. Cloudflare's postmortem pointed to this check as the culprit:
```
if ( ++p == pe )
    goto _test_eof;
```
So, Cloudflare's post said that had this check been >=, the error would've been caught. But Ms. Sireneva pointed out that this is not necessarily the case! As per the C standard section 7.6.6:
When an expression J that has integral type is added to or subtracted from an expression P of pointer type, the result has the type of P.
- If P evaluates to a null pointer value and J evaluates to 0, the result is a null pointer value.
- Otherwise, if P points to a (possibly-hypothetical) array element i of an array object x with n elements ([dcl.array]), the expressions P + J and J + P (where J has the value j) point to the (possibly-hypothetical) array element i+j of x if 0≤i+j≤n and the expression P - J points to the (possibly-hypothetical) array element i−j of x if 0≤i−j≤n.
- Otherwise, the behavior is undefined.
So, if I understand this correctly, doing a >= comparison on an out of bounds p is also UB, meaning the compiler would theoretically have the right to optimize this >= comparison to ==. I'm not sure if real world compilers actually do that, but it is a possibility.
Rust fixes this issue by using fat pointers, which store the length as an integer. When we index a slice, Rust first executes a >= comparison on two integers (index and length), which is well-defined. And only if the index is less than the length is it added to the slice pointer, ensuring that the resulting pointer is valid. ↩
This is based on a short talk given at 2018 Bay Area Rust Meetup. The slide listing the reasons for choosing Rust (4:30) says somewhat coyly: "Safe (we had a bug once...)". ↩
The only other forward safe systems language used in fundamental projects that I know of is KaRaMeL: an F* dialect which compiles to C. Furthermore, since it's written in F*, it can be verified, insuring logical correctness. There's also verification of existing systems, but I don't count it, because, from what I've read, maintaining both the code and the proofs is tough and expensive. ↩