Performance pitfall in a tight loop

Over the past few days I've been optimizing a handful of functions in my project which become bottlenecks at a large scale. One of them was made up of a simple loop:

fn calculate(
    intervals: &Intervals,
    tree: &Tree,
    population_size_at: impl Fn(f64) -> f64,
    integral: impl Fn(f64, f64) -> f64,
) -> f64 {
    let mut out = 0.0; // log-likelihood
    let mut last_height = intervals.state[0].1;
    let mut num_lineages = 1usize;

    for (node, height) in intervals.state.iter().skip(1) {
        let binomial = (num_lineages * (num_lineages - 1) / 2) as f64;
        let area = integral(last_height, *height);
        out -= binomial * area;

        if tree.is_internal(*node) {
            // merge event
            let pop = population_size_at(*height);
            out -= pop.ln();
            num_lineages -= 1;
        } else {
            // the node is a leaf, increase the number of
            // linages
            num_lineages += 1;
        }

        last_height = *height;
    }

    out
}

integral and population_size_at are small closures which get inlined, so this function gets compiled to a tight loop. For a small number of intervals (under 1 thousand) calculate barely shows up on flamegraphs, but when there are 80 thousand elements¹ the it starts taking up around 240μs per call, which adds up to 4 minutes of wall time per 1 million calls. This is a problem because at such scales one million steps² take around 35 minutes to complete, of which 4 more minutes is spent on other bookkeeping functions and the rest is taken up by a very thoroughly optimized algorithm which runs on the GPU.

Thus, calculate was a pretty glaring bottleneck. It's hard to parallelise because num_lineages creates a dependency between all elements. I could track num_lineages as a third element in intervals, but that'd make the function which incrementally updates the intervals much more expensive³. After a cursory look I decided that calculate was pretty much unoptimizable and focused on other expensive functions.

Until one evening I did a double take and dug into what calculate was actually doing. I profiled the program and took a look at the perf report. This was the event count of the compiled loop:

        │1c0:┌─→movsd    QWORD PTR [rsp+0x8],xmm3
   0.54 │    │  add      rcx,0x10
   3.35 │    │  cmp      r13,rcx
   0.02 │    │↑ je       152
        │1cf:│  movapd   xmm2,xmm1
        │    │let binomial = (num_lineages * (num_lineages - 1) / 2) as f64;
   0.20 │    │  lea      rdx,[rax-0x1]
        │    │let area = integral(last_height, *height);
        │    │  movsd    xmm1,QWORD PTR [r12+rcx*1+0x28]
        │    │|start, end| (end - start) / pop_size,
   0.18 │    │  movapd   xmm3,xmm1
   4.12 │    │  subsd    xmm3,xmm2
   0.11 │    │  divsd    xmm3,QWORD PTR [rsp+0x18]
        │    │let binomial = (num_lineages * (num_lineages - 1) / 2) as f64;
   2.49 │    │  mov      rsi,rdx
   4.81 │    │  imul     rsi,rax
        │    │  shr      rsi,1
   0.07 │    │  xorps    xmm2,xmm2
        │    │  cvtsi2sd xmm2,rsi
        │    │out -= binomial * area;
   4.68 │    │  mulsd    xmm2,xmm3
   0.53 │    │  movsd    xmm3,QWORD PTR [rsp+0x8]
  51.59 │    │  subsd    xmm3,xmm2
        │    │if tree.is_internal(*node) {
  19.36 │    │  inc      rax
        │    │  cmp      QWORD PTR [r12+rcx*1+0x20],rbp
   0.00 │    │  cmova    rax,rdx
        │    └──jbe      1c0

In hindsight, it's pretty obvious what's wrong here. But at that time I focused on the cvtsi2sd operand — "Convert Doubleword Integer to Scalar Double Precision Floating-Point Value". After a quick search I got the impression that the instruction it was encoding had a latency of 12 cycles⁴. So, I changed the type of num_lineages from usize to f64 and to my amazement the function became twice as fast.

On the next day I decided to sit down and write a post about how a single innocuous instruction could slow down a loop so much. But I couldn't find anything about problems with integer to floating point conversion online, even though it seemed like this should be a common issue. The closest was this Stack Overflow post with an unrolled loop which used cvtsi2sd becoming slower. Turns out the cvt family of instructions has an issue with false dependencies. An xmm register can hold two floating point values and cvtsi2sd xmm3,rsi only overwrites the lower, leaving the higher one untouched⁵. This means that subsequent instructions which use xmm3 might depend on its value before cvtsi2sd was called, reducing the opportunities for out-of-order execution⁶. But that wasn't the case in my code. As can be seen in the assembly above, compilers will insert a zeroing xorps xmm3,xmm3 instruction, which breaks this false dependency.

Somewhat puzzled, I went back to my code, switched num_lineages back to usize, put an inline(never) on calculate to make it easier to review, and profiled it again. Imagine my surprise when I saw that I saw the function run at completely the same speed as it did with the floating point num_lineages. For a moment I thought that I might've imagined all of the performance gains I saw the day before. Baffled, I checked out the previous commit. The function started running twice as slow again, even though its contents were absolutely identical to what I had with my previous edits. Here's the perf report:

   1.12 │ a0:┌─→add      rcx,0x10
   1.32 │    │  cmp      r15,rcx
        │    │↑ je       6f
        │ a9:│  movapd   xmm3,xmm2
        │    │let binomial = (num_lineages * (num_lineages - 1) / 2) as f64;
   6.90 │    │  lea      rdx,[rax-0x1]
        │    │let area = integral(last_height, *height);
   0.24 │    │  movsd    xmm2,QWORD PTR [r12+rcx*1+0x28]
        │    │|start, end| (end - start) / pop_size,
   0.58 │    │  movapd   xmm4,xmm2
   0.96 │    │  subsd    xmm4,xmm3
   9.13 │    │  divsd    xmm4,xmm5
        │    │let binomial = (num_lineages * (num_lineages - 1) / 2) as f64;
   8.30 │    │  mov      rsi,rdx
   2.04 │    │  imul     rsi,rax
   0.12 │    │  shr      rsi,1
   8.56 │    │  xorps    xmm3,xmm3
   0.28 │    │  cvtsi2sd xmm3,rsi
        │    │out -= binomial * area;
   2.29 │    │  mulsd    xmm3,xmm4
  14.63 │    │  subsd    xmm1,xmm3
        │    │if tree.is_internal(*node) {
  16.73 │    │  inc      rax
   1.64 │    │  cmp      QWORD PTR [r12+rcx*1+0x20],r14
   0.12 │    │  cmova    rax,rdx
   0.61 │    │↑ jbe      a0
        │    │  subsd    xmm1,xmm0
  24.42 │    └──jmp      a0

And, for comparison, the new version which used an f64 num_lineages:

        │ d0:┌─→add    rax,0x10
   1.89 │    │  movapd xmm2,xmm7
   2.59 │    │  cmp    r15,rax
   0.63 │    │↑ je     79
        │ dd:│  movapd xmm7,xmm1
        │    │let binomial = num_lineages * (num_lineages - 1.0) / 2.0;
   1.08 │    │  movapd xmm8,xmm3
   0.91 │    │  addsd  xmm8,xmm4
        │    │let area = integral(last_height, *height);
   9.39 │    │  movsd  xmm1,QWORD PTR [r12+rax*1+0x28]
        │    │|start, end| (end - start) / pop_size,
   2.48 │    │  movapd xmm9,xmm1
   3.79 │    │  subsd  xmm9,xmm7
   4.54 │    │  divsd  xmm9,xmm10
        │    │let binomial = num_lineages * (num_lineages - 1.0) / 2.0;
   9.58 │    │  movapd xmm7,xmm3
   2.84 │    │  mulsd  xmm7,xmm8
   2.27 │    │  mulsd  xmm7,xmm5
        │    │out -= binomial * area;
   2.80 │    │  mulsd  xmm7,xmm9
  27.49 │    │  addsd  xmm7,xmm2
        │    │if tree.is_internal(*node) {
   5.50 │    │  addsd  xmm3,xmm6
   1.21 │    │  cmp    QWORD PTR [r12+rax*1+0x20],r14
   0.91 │    │↑ jbe    d0
        │    │  subsd  xmm7,xmm0
  20.09 │    │  movapd xmm3,xmm8
        │    └──jmp    d0

With these 3 reports side by side it should be painfully obvious what's wrong, but it still took me some time to realize what was wrong. The original version was spilling the out accumulator on the stack!

I hyper-focused on cvtsi2sd, but it wasn't the problem. In fact, it couldn't have.

        │  cvtsi2sd xmm2,rsi
   4.68 │  mulsd    xmm2,xmm3
   0.53 │  movsd    xmm3,QWORD PTR [rsp+0x8]
  51.59 │  subsd    xmm3,xmm2

The instruction which kept stalling here was subsd, which depended on xmm2 (binomial) and xmm3. But xmm2 was also the destination register of mulsd, and if cvtsi2sd was causing this, the stall should've happened at mulsd. Instead the culprit was obviously movsd, which was reading from memory.

I wasn't able to figure out what exactly was causing the spill. The parent functions of calculate simply load a few floating point values from behind mutexes. This pathological behavior only happens when the integer to float conversion is present. Thus, in a roundabout way, cvtsi2sd might really be the issue because it somehow disturbs the heuristics LLVM uses to spill the registers.

Moral of the story: reading assembly is an important skill one shouldn't neglect⁷.

Intervals are nodes of a binary tree sorted by their height/date. ↩
It's an Markov chain Monte Carlo simulation and calculate is called once per step. ↩
And the intervals have to be incrementally updated, because they have to be sorted, and sorting 80 thousand elements takes up quite a bit of time. In fact, even with partial updates, the function doing those is one of the rest of the "slow" functions I was trying to optimize. ↩
It turns out that I was looking at the data for a very old generation of CPUs. On Zen 4/5 and Ice Lake the latency for this instruction is around 6-7 cycles, according to this source. ↩
https://www.felixcloutier.com/x86/cvtsi2sd ↩
Stack Overflow: Why does adding an xorps instruction make this function using cvtsi2ss and addss ~5x faster?. ↩
An earlier version of the moral blamed my ills on memory accesses, but after some self-reflection I realized the main issue here was my mistake in interpreting the perf annotation which happened because I wasn't comfortable enough with assembly.
Also, in this particular case memory as it is commonly understood wasn't even the main problem. The intervals access is a linear scan, which cannot be avoided and is about as good as it gets when it comes to reads. There's also the tree node state fetch, but here the problem seems to be mostly about its randomness, which can't be fixed by prefetching when updating intervals. ↩