Non-Traditional Profiling

Also known as “you can just put whatever you want in a jitdump you know?”

When you profile JIT code, you have to tell a profiler what on earth is going on in those JIT bytes you wrote out. Otherwise the profiler will shrug and just give you some addresses.

There’s a decent and fairly common format called jitdump, which originates in perf but has become used in more places. The basic thrust of the parts we care about is: you have names associated with ranges.

Of course, the basic range you’d expect to name is “function foo() was compiled to bytes 0x1000-0x1400

Suppose you get that working. You might get a profile that looks like this one.

This profile is pretty useful: You can see from the flame chart what execution tier created the code being executed, you can see code from inline caches etc.

Before I left for Christmas break though, I had a thought: To a first approximation both -optimized- and baseline code generation is fairly ‘template’ style. That is to say, we emit (relatively) stable chunks of code for either one of our bytecodes, in the case of our baseline compiler, or for one of our intermediate-representation nodes in the case of Ion, our top tier compiler.

What if we looked more closely at that?

Some of our code is already tagged with AutoCreatedBy, and RAII class which pushes a creator string on, and pops it off when it’s not used. I went through and added AutoCreatedBy to each of the LIR op’s codegen methods (e.g. CodeGenerator::visit*). Then I rigged up our JITDump support so that instead of dumping functions, we dump the function name + whole chain of AutoCreatedBy as the ‘function name’ for that sequence of instructions generated while the AutoCreatedBy was live.

That gets us this profile

While it doesn’t look that different, the key is in how the frames are named. Of course, the vast majority of frames just are the name of the call instruction... that only makes sense. However, you can see some interesting things if you invert the call-tree

For example, we spend 1.9% of the profiled time doing for a single self-hosted function ‘visitHasShape’, which is basically:

masm.loadObjShapeUnsafe(obj, output);
  masm.cmpPtrSet(Assembler::Equal, output, ImmGCPtr(ins->mir()->shape()),
                 output);

Which is not particularly complicated.

Ok so that proves out the value. What if we just say... hmmm. I actually want to aggregate across all compilation; ignore the function name, just tell me the compilation path here.

Woah. Ok, now we’ve got something quite different, if really hard to interpret

Even more interesting (easier to interpret) is the inverted call tree:

So across the whole program, we’re spending basically 5% of the time doing guardShape. I think that’s a super interesting slicing of the data.

Is it actionable? I don’t know yet. I haven’t opened any bugs really on this yet; a lot of the highlighted code is stuff where it’s not clear that there is a faster way to do what’s being done, outside of engine architectural innovation.

The reason to write this blog post is basically to share that... man we can slice-and-dice our programs in so many interesting ways. I’m sure there’s more to think of. For example, not shown here was an experiment: I added AutoCreatedBy inside a single macro-assembler method set (around barriers) to try and see if I could actually see GC barrier cost (it’s low on the benchmarks I checked yo).

So yeah. You can just... put stuff in your JIT dump file.

Edited to Add: I should mention this code is nowhere. Given I don’t entirely know how actionable this ends up being, and the code quality is subpar, I haven’t even pushed this code. Think of this as an inspiration, not a feature announcement.