July 2, 2009

Feel the cache size: a definitive experiment

Long gone the wonderful era of 640 kilobytes of memory and 8 to 12 MHz of clock speed, which "ought to be enough for anybody". Though Mr.Gates never said that, I still think that's a fair amount of memory and good speed for a lot of things. I wouldn't have said that if I hadn't worked in a fully-blown windowed desktop publishing system called Quark XPress on a Macintosh LC, which had, I think, some 2 MB of RAM and ran at 16MHz, back in 1991. Time has passed, the download size of the latest Quark XPress is about half a gig now (as opposed to the prehistoric version that fit a few diskettes at the time), we have gigabytes of memory and we run at staggering gigahertz speeds, and yet the same program with basically same functionality (oh, okay, plus-minus) is slower on a system that's a thousand times more powerful! Or take Acrobat Reader that has grown 15 times in size, has become damning slow and resource-hungry across versions 3.0 through 9.0 while doing same old thing - rendering PostScript programs.

Shouldn't 2 gigabytes/2 gigahertz really be enough for almost anything we can think of today, except maybe some heavy real-world 3D simulations, including games? If you are not going to calculate the state of all atoms in the Solar system in real time, 2/2 ought to be enough for you. Um, if you write your code carefully and thoughtfully. I don't think binaries I produced ever reached even a few megabytes in size, although there were some fairly complicated projects I worked on with hundreds of thousands of LOC. Hundreds of thousands of lines, mind you, usually means a gigantic project on the brink of unmaintainability.

Unfortunately the reason bloatware exists is not only because of the fallacy "don't optimize your code, because Mr.Moore is going to do that for you soon". One other reason that becomes apparent if you look at bloatware closely, is that large part of it is commercial software. Commercial vendors want you to feel comfortable about the money you pay. Because otherwise, just picture for a moment that you paid $500 of your hard-earned money for something 500 KB in size that downloads faster than you can blink. Doesn't this picture feel somewhat wrong?

I can't think of some bosses, of course, sitting there at Adobe, Microsoft and Sun, forcing employees to deliver bigger binaries. The mechanism is a bit more subtle: it is, I think, a whole culture of languages and platforms that unobtrusively force developers to produce monster code. Yes, that's Java and C# with their verbosity coupled with everything-is-an-object philosophy, as well as the C++ world, where nothing practically stops you from reckless use of templates - possibly the bloater number one in the neighbourhood.

I'm just wondering when (and if) natural selection can finally come into play in the software industry and wipe out the dinosaur herd in favour of smaller, quicker and smarter species. There are some barely noticable signs this is happening now (startups), but clearly, it's too early to trumpet Nature's victory.

***

But enough of emotions. I'm going to carry out an experiment to prove that we still live in the world of kilo and mega rather than giga, at least memory-wise. Why? The CPU cache.

It shouldn't come as a surprise to you that there's CPU cache in your machine: usually a very fast memory chip from kilobytes to megabytes in size, designed to make memory operations faster. Unfortunately it comes at a price: the CPU cache is organized in small chunks called cache lines, typically 64 bytes in size, that are read from and written to the main memory at once, as a whole. Because every cache miss triggers a memory read of a whole 64-byte chunk, while all the processor requested was, say, 4 bytes, it means in certain circumstances caching makes things worse than if there was no cache at all.

So what I'm going to do is, build a program that can do something - actually same thing at every run, using code segments of varying sizes. We will change the code block from a few bytes to tens of megabytes and measure the difference in running time. Would it be fair? I think so, yes. We are not going to interpret the results though, as in how and why, but rather how much, that is, this is going to be a purely quantitative experiment.

The idea is this: I generate a C program with exactly one million functions, each returning some number:
int f0() { return 1; }
int f1() { return 2; }
int f2() { return 3; }
int f3() { return 4; }
int f4() { return 5; }
...

and so on, to one million. Each function translates to 11 bytes of code:
_f0:
pushq %rbp
movq %rsp, %rbp
movl %edi, -4(%rbp)
movl -4(%rbp), %eax
addl $1, %eax
leave
ret

I also generate a table of pointers to these functions so that we can pick them randomly in a tight loop.

Poor, poor GCC, it took almost 20 minutes to compile the file, even though I turned off optimizations. Anyway. The main module looks like this:
#include <stdlib.h>
#include <time.h>
#include <stdio.h>

#include "tab.h"

#define PRIME 99971
#define LOOP 1000000000


void run(int max)
{
time_t begin = time(NULL);
srand(PRIME);

int i;
for (i = 0; i < LOOP; i++)
func_table[rand() % max]();

int diff = (int)(time(NULL) - begin);
printf("code size: %ld time: %d secs\n",
func_table[max] - func_table[0], diff);
}


int main()
{
printf("Running...\n");
run(1);
run(10);
run(100);
run(1000);
run(10000);
run(100000);
run(1000000);
return 0;
}

As you can see, at every run we do equivalent job (same number of iterations, identical functions called), except bigger and bigger chunks of the code segment are involved. The results? Voila:
code size: 11  time: 12 secs
code size: 110 time: 22 secs
code size: 1100 time: 29 secs
code size: 110000 time: 36 secs
code size: 1100000 time: 45 secs
code size: 11000000 time: 174 secs

Notice the spike on the last line? That's a transition from 1MB to 10MB, whcih crosses the actual size of my L2 cache I have on my machine, and it's 2MB.

Anyway. The conclusion is, performing the same job using different binary sizes can be up to 15 times slower. That's it, and I'll leave you there deciding whether you should write simple, succinct, beautiful and fast code, or you should impress the rest of the world with "methodologies" you use and the monstrosity of your project. Sometimes size translates to money, true, but the opposite is your satisfaction with the job done. Haven't you decided yet?

Update: A lot of people pointed out that the code used in the experiment does not represent typical real-world situations. This is true, but let's say the impact in your case is 20% rather than 15 times. If it's a virtual machine for your language, for example, then optimization would be something you'd struggle for, as 20% for a runtime environment is a huge gain.

8 comments:

Jengu said...

With templates you only pay for what you actually instantiate. If you're only paying for what you instantiate, you're paying the same cost you would pay writing non-templated code anyway. So I'll keep my fancy policy based programming and let you keep wasting your time debugging copy and paste errors.

hm said...

Jengu, in duck-typed dynamic languages you achieve basically same results as with templates in C++, and yet there are no templates in Python, Ruby or PHP, and no copy/paste. Templates are not needed there. There is a small performace overhead when using dynamic variables, as opposed to potentially huge size overhead with C++ templates.

I was thinking, by the way, about a series of articles with some nice template-free C++ containers and variants.

nolem said...

One point is also by writing "simpler" code (for example doing a function call instead of relying on templates) is that the behavior is more predictable. With templates the compiler is likely to instantiate it inline and thus cause code bloat which results in cache misses and that's what the article is all about anyway

Greg Herlein said...

What's in tab.h? I suspect that's where you define func_table. Won't compile without it.

hm said...

Hi Greg,

Here is tab.h:

typedef int (*func)();
#define TAB_MAX 1000001
extern func func_table[TAB_MAX];

You will need to generate the functions for your tab.c (the pattern is shown in the article) and also the pointer table:

func func_table[TAB_MAX] = {
f0,
f1,
f2,
f3,
...

Then compile tab.c, main.c with NO optimization (i.e. no -O switches of any level) and link them.

Bernd Eckenfels said...

this does not show that large binaries are the problem, in fact it shows that large binaries do not affect the performance as long as you run computations with a small instruction cache footprint.

Paul Keeble said...

Caches are about common access patterns. Its one of those classic 80/20 rules that 80% of the programs runtime is running the same 20% of the code (or maybe its a 90/10 rule).

Caches don't work when that rule isn't true, and in very complicated programs you are right that it might be costing some amount of processing time, but the figures here are the absolute worst it could be. Optimising for locality of reference however has improved performance of programs dramatically for 2 decades and developers do know this principle exists!

I suspect the most common cause of the feeling of slowness on modern PCs is actually due to hard drives. While CPUs have been increasing in performance by a factor of at least 60 since 1996, drives have gone from about 5MB/s to 100MB/s, and the access times haven't dropped much at all (about 15ms to about 11ms for desktop drives, mainly because the normal desktop now uses 7200rpm not 5400rpm). Big binaries take longer to load, and big files take longer to read and write if they have gotten bigger faster than the performance of the hard drive. CPUs are sat idle all over the world working on word processors, web browsers etc waiting on the user and on the hard drive. SSDs will help but they won't solve the increasing distance from CPU performance to storage.

The final part of the puzzle is that CPU speed has also zoomed away from memory clock speeds. With the 486 we had equal(ish) clock speeds, but memory has been stuck at around 200Mhz for a long time. Sure we have worked out how to improve the bandwidth by reading more data in a block with DDR to keep the programs running, but that only works if locality of reference is how the program accesses memory.

The moral of this rambling is that random access has increased in relative cost, both in terms of memory and storage - simply because random access has not improved in performance in either case since the mid 90's. While sustained read/write has increased more dramatically its not done so at the same rate as CPU performance. Bloat (aka as features) when scaling with CPU speeds is disproportionate hit when those other components are used. Cache is about the only thing holding everything to a decent level of performance.

Thankfully most programs, even big commercial ones, are highly relative in access patterns and do have decent locality of reference and small core algorithms. Not to mention the fact that they tend to iterate through memory in nice linear chunks for good reason.

hm said...

Paul, I agree the bottleneck for most applications is the bus, the HD, etc, but still there are cases with massive code base (virtual machines for one thing) where any part can be executed in a seemingly random manner. A lot of programmers think code size can be sacrificed for speed gains, which may be simply not true if your code is comparable with the cache size.