A Performance Investigation Challenge

I really liked matklad’s Performance Visualization challenge (partially because it didn’t take me long to find the line with samply, which made me feel good).

Here’s a skills gap or research question perhaps: How do you identify an impactful but diffuse problem. I have a concrete example in mind.

So, nine months ago, trying to optimize a function in Speedometer 3, my colleague Iain Ireland dug through the generated assembly, and largely walked away uncertain as to what was going on.

Fast forward to one month ago: He landed a weirdly impactful patch. He removed an ‘optimization’ where we used to write the tag separately from the value where the Value representation let us do that. This change alone improved some benchmarks by 6-8%. The best hypothesis here is that on some x86 hardware this split write totally broke store forwarding, and heavily neutered performance and thus really impacted the performance.

Now: Once we had the hypothesis that this was a store forwarding problem we were able to show that the patch reduced the amount of failed forward using perf and performance counters.

The research / methodological question I pose here however is: How on earth do you find these sorts of problems without luckily ending up staring at them! One has to imagine there’s other issues hanging out there, but I really have no idea how to find them.

Now, I have an unread copy of System’s Performance by Brendan Gregg sitting on my desk, and maybe the answer is in there, but I’m curious about if anyone has any techniques or methodologies that have worked well for them, or if this is a research area that still needs work (Automated Mechanical Sympathy anyone?)

Post-Publication Postscript:

Iain writes:

To be clear, I think we're relatively confident we understand the problem in retrospect. It's the load-store conflict problem described here: https://zeux.io/2025/05/03/load-store-conflicts/

In particular, if you search for "Indeed, if we check the Zen 4 optimization guide, we will see (emphasis mine)", there's a pull quote that says "The LS unit supports store-to-load forwarding (STLF) when there is an older store that contains all of the load’s bytes"

Which is precisely the thing we did not do. So I don't even think this is our "best hypothesis", I think it's just the answer.