An Inline Cache isn’t Just a Cache

If you read the Wikipedia page for Inline Caching, you might think that inline caches are caches, in the same way that you might talk about a processor cache, or a page cache for a webserver like memcached. The funny thing is, that really undersells how they're used in SpiderMonkey (and I believe other JS engines), and in hindsight I really wish I had known more about them years ago.

The Wikipedia page cites a paper L. Peter Deutsch, Allan M. Schiffman, "Efficient implementation of the Smalltalk-80 system", POPL '84, which I found on the internet. In the paper the authors discuss the key aspect of their implementation being the ability to dynamically change representation of program code and data,

"as needed for efficient use at any moment. An important special case of this idea is > caching> : One can think of information in a cache as a different represenation of the same information (considering contents and accessing information together)"

Part of their system solution is a JIT compiler. Another part is what they call inline caching. As they describe them in the paper, an inline cache is self modifying code for method dispatch. Call sites start as pointing to a method-lookup: The first time the method-lookup is invoked, the returned method (along with a guard on type, to ensure the call remains valid) overwrites the method-lookup call. The hypothesis here is that a particular call site will very often resolve to the same method, despite in principle being able to resolve to any method.

In my mind, the pattern of local self-modifying code, the locality hypothesis, as well as the notion of a guard are the fundamental aspects of inline caching.

The paper hints at something bigger however. On Page 300 (pdf page 4)

For a few special selectors like +, the translator generates inline code for the common case along with the standard send code. For example, + generates a class check to verify that both arguments are small integers, native code for the integer addition, and an overflow check on the result. If any of the checks fail, the send code is executed.

It's not clear if the authors considered this part of the inline caching strategy. However, it turns out that this paragraph describes the fundamental elements of the inline caching inside SpiderMonkey.

When SpiderMonkey encounters an operation that can be efficiently performed under certain assumptions, it emits an inline cache for that operation. This is an out-of-line jump to a linked list of 'stubs'. Each stub is a small piece of executable code, usually consisting of a series of sanity checking guards, followed by the desired operation. If a guard fails, the stub will jump either to another one generated for a different case (the stubs are arranged in a linked list) or to the fallback path, which will do the computation in the VM, then possibly attach a new stub for the heretofore not-observed case. When the inline cache is initialized, it will start pointed to the fallback case (this is the 'unlinked' state from the Smalltalk paper).

SpiderMonkey generates these inline caches for all kinds of operations: Property accesses, arithmetic operations, conversion-to-bool, method calls and more.

Let's make this a little more concrete with an example. Consider addition in Javascript: function add(a,b) { return a + b; }

The language specifies an algorithm for figuring out what the correct result is based on the types of the arguments. Of course, we don't want to have to run the whole algorithm every time, so the first time the addition is encountered, we will attempt to attach an inline cache matching the input types (following the locality hypothesis) at this particular operation site.

So let's say you have an add of two integers: add(3,5) The first time through the code will be an inline cache miss, because there is none generated. At this point, SM will attach an Int32+Int32 cache, which consists of generated code like the following pseudo code:

int32_lhs = unbox_int32(lhs, fail); // Jump to fail if not an int32
int32_rhs = unbox_int32(rhs, fail);
res = int32_lhs + int32_rhs;
if (res.overflowed()) goto fail; 
return res;

  goto next stub on chain

Any subsequent pair of integers being added (add(3247, 12), etc) will hit in this cache, and return the right thing (outside of overflow). Of course, this cache won't work in the case of add("he", "llo"), so on a subsequent miss, we'll attach a String+String Cache. As different types flow through the IC, we build up a chain (*) handling all the types observed, up to a limit. We typically terminate the chains when they get too long to provide any value to save memory. The chaining here is the 'self-modifying' code of inline caching, though, in SpiderMonkey it's not actually the executable code that is modified, just the control flow through the stubs.

There have been a number of different designs in Spidermonkey for inline caching, and I've been working on converting some of the older designs to our current state of the art, CacheIR, which abstracts the details of the ICs to support sharing them between the different compilers within the SpiderMonkey Engine. Jan's blog post introducing CacheIR has a more detailed look at what CacheIR looks like.

So, in conclusion, inline caches are more than just caches (outside the slightly odd Deutsch-Schiffman definition). The part of me that likes to understand where ideas come from would be interested in knowing how inline-caching evolved from the humble beginnings in 1983 to the more general system I describe above in SpiderMonkey. I'm not very well read about inline caching, but I can lay down an upper limit. In 2004, the paper "LIL: An Architecture Neutral Language for Virtual-Machine Stubs", describes inline cache stubs of similar generality and complexity, suggesting the range is somewhere between 1983 and 2004.

(*) For the purposes of this blog post, we consider the IC chains as a simple linked list, however reality is more complicated to handle the type inference system.

Thank you so much to Ted Campbell and Jan de Mooij for taking a look at a draft of this post.

Addendum:  Tom Schuster points out, and he's totally right, that the above isn't entirely clear: This isn't the -most- optimized case. IonMonkey will typically only fallback to Inline Caches where it can't do something better.