Let's say you have a program, and you'd like to find out if you can make it faster by changing the compiler (or to a library, or setting. I'm a compiler guy, so I'll talk about compilers here on in).

This doesn't seem like too hard a problem does it? It sure is easy to imagine a simple method:

Run the program with the old, baseline version of the compiler.
Run the program with the new, modified version of the compiler.
Compare the times.

This makes sense doesn't it? Time to publish!

Unfortunately for your nascent publication, you start thinking: What happens if I run the programs again?

If the year is 1977, and you're running the program with one thread, on a machine with no preemption, with an in-order processor that has no cache (say, an Apple II), you'll probably get the same time. Perhaps even down to the cycle.

So you run your programs again: Lo and behold the times have changed, both before and after.

Now what can you say about the affect of your change to the compiler? What happened!?

The problem

It's not 1977.

Let's talk about the where variance can come from when you're benchmarking today. This list gets longer every day. Remember, this is all your computer trying to be helpful, and improve overall system performance and user-experience

You're likely running your benchmark on a machine with a preemptively multitasking operating system, with virtual memory, file systems, daemons etc.

The process your benchmark runs in can be preempted in favour of another, or not. Context Switches will incur overhead, and so how often your benchmark process is switched out will have an effect on final runtime.
Your machine is chock-a-block filled with caches from top to bottom (L1, L2, L3, (L4), Disk Cache,Page Cache). The contents of these caches, whether they are warm or cold for the workload will change the result.
Your benchmark likely with multiple threads, which means synchronization and thread scheduling can come into play, with the order of interleavings affecting the benchmarks outcome.

At this point you're probably already despairing: "How is it possible to say anything about computer performance".

Oh, but wait, the problem gets worse.

What if your benchmark is written for one of these new-fangled JIT compiled languages? The longer your program is running, the faster it gets as more pieces are compiled, first into native code, then into better native code.

Now let's make the problem even worse!

What if your test machine is a relatively recent Intel machine? Then your performance will depend on how warm your machine is, because of a feature called Turbo Boost, which lets these processors overclock themselves so long as they remain within their electrical and thermal limits.

This means that by simply putting your machine by a cold window (hello Canadian Winter!) you might get better results!

Modern Benchmarking

So this is obviously concerning. What do we do about this?

A big part of the answer comes from acting like scientists, and using the scientific method.

In this case, we need to use controlled experimentation, which in this case means controlling for as many possible variables as possible. You want your experimentation to be as much about the effect you are trying to measure as possible, and replicable.

Try to make the machine you use to test as quiet as possible. Your goal is to eliminate as many sources of variance in the runtime results you'll collect. Disable any GUI if you can, any daemons that aren't required, kick everyone else off the machine, etc.
Disable any frequency scaling. As I've discovered, sometimes you have to look deep. For example, in Linux, there's the CPU Frequency scaling system, which is separate from the Intel Turbo Boost Control. Disable both! (Here's a pointer for earlier-than-3.9 series kernels)

Unfortunately, even a quiet machine, you have relatively little control over things like caches and synchronization effects. This means that we need another tool: Statistics.

To show anything, you're going to have to quantify-- through multiple runs-- and account for run-to-run variance. To show improvement, a statistical hypothesis test should be used. If that's too hardcore, at least quantifying the variance will go a long way to showing your result is not a fluke.

I haven't even started to discuss how to combine (ppt) results when you have multiple programs to test.

Conclusion

The status-quo in evaluation is changing-- and this is a great thing; bad science can lead us down dark-alleys and dead ends, which wastes time (and ).

Unfortunately, the difficulty of doing this right means that you're going to see bad benchmarking all the time. We need to try not to be disheartened, and instead, try simply to be better!

Things to look forward to, in the future:

Virtual Machine Images as a basis for evaluation. In some ways this forms the backbone of the ultimate form of replicability: Dump the source code, the compiler and build-scripts used to compile, all the libraries, dynamic and static, as well as all testings scripts into a virtual machine, and distribute THAT as artifact of your evaluation.

A reminder that experimentation is hard, and it is not just benchmarking is provided by a terribly interesting paper on behavioural science.

(Update: January 19, 2014: My supervisor reminds me of a couple things I forgot to make clear:

Reproducibility requires documentation: Experiments need to have all their conditions documented. Remind readers of the conditions whenever you discuss the results of these experiments.
One of the best ways to ensure reproducibility is to have a testing harness which tracks the experimental conditions, and manages the results.

)

The problem

Modern Benchmarking

Conclusion

More reading