July 22, 2009

Unsigned int considered harmful

Something always intuitively stopped me from using (or rather overusing) unsigned ints in my programs, unless, of course, I was forced to by some library or an API - an area where you can see all kinds of unsigned handles and descriptors, memory size types, etc. One day though I decided to finally try them out and in my next big project, I started declaring everything that should seemingly hold only positive values as unsigned int.

I got caught very soon like a child, by a very simple mistake. Let's say I have an array of something, pointers maybe, and in one place I iterate over that array to initialize some objects, and then in some other place I iterate over it backwards to call finalization methods. Array indexes in C/C++ should always be positive, so I wrote:

for (unsigned i = 0; i < count; i++)
array[i]->initialize();

// ...

for (unsigned i = count - 1; i >= 0; i--)
array[i]->finalize();

Oops. A bug. Unsigned ints can never be less than zero, so i >= 0 actually means after the last iteration my poor i will jump to the other end of the world trying to read the 4,294,967,295th element of my array, and will of course fall like a hero. Not to mention that an empty array with count=0 will result in a similar disaster. The shortest correct way of doing backward iteration is:

for (unsigned i = count; i--; )
array[i]->finalize();

This, however, doesn't look terribly elegant, and it is, let's say, not very educational either. I mean, would you show this as the only method of iterating backwards in C to you grandchildren when you retire? No.

The philosophical aspect of this is, when you are very close to the "edge", i.e. the area of zero, and that's where the numbers we deal with in our programs mostly reside, signed ints are safe to cross the edge without catastrophic consequences, because a simple comparison with zero just works in the "natural" way. Something like i >= 0 has been working for centures, since the invention of algebra.

Ok, but that's not all.

In the world of descriptors, handles and memory sizes there is always a special value for no-descriptor, no-handle, or no-location, and that's a typecast'ed -1. For example, there is std::string::npos, which is, I suppose, a cool C++ way of saying just -1, or INVALID_HANDLE_VALUE, which is, in turn, a cool Microsoft way. Apart from that you always have to use a macro or a constant, or a typecast'ed -1 if you are brave enough, there is actually one quite troublesome issue.

Consider you have some unsigned type ufoo_t defined many headers away, and there is also ubar_t defined even farther away, and you kind of have an idea that they bear the same meaning. While juggling with your handles/descriptors or memory sizes you mix values of both types.

Now, one day you decide to compile your project on a 64-bit platform for the first time, and you discover that your invalid memory location (or a handle - whatever it is) is not -1 but 0x00000000FFFFFFFF. Why? Because ufoo_t happens to be declared as unsigned long, whereas ubar_t is unsigned int, and on your 64-bit platform they have different sizes. And because the sign bit is not extended for unsigned ints, it is lost in translation and here you go, your -1 becomes something much less sensible.

I think entities that in some exceptional cases can hold indication of an error are much safer and simpler to declare as signed ints in terms of kicking the value around. One of the most important functions in UNIX, open(), returns -1 on error and there is a whole universe of happily working programs that deal with signed file descriptors and compare the return value with just -1.

So what is it, really, that makes unsigned ints better in any way than signed ints? The upper bounds? I bet you don't even remember them precisely, but I'll remind you that we are talking about 2,147,483,647 vs. 4,294,967,295. What task is it that should be solved on a 32-bit machine and needs numbers exactly between 2 and 4 billion? Because if you deal with numbers below 2b, you are fine with signed ints, and if they are greater than 4b, that's a 64-bit story. In general, if you need billions you simply go for 64-bit to be safe, don't you?

There is one disadvantage of using signed ints that careful programmers would point out, of course, and that's when you check your array index for validity. The condition is:

if (index < 0 || index >= count)
error();

or one comparison more compared to the case with an unsigned index:

if (index >= count)
error();

This is true, but I wouldn't worry about this at all, because index < 0 usually translates to 2 CPU instructions: something like CMP and JMP, where the jump target is just a few instructions away, which means with some optimization techniques available on many modern processors the overhead can be considered in infinite proximity of 0.

All in all, my experiment with unsigned ints was so negative that I'm seriously considering switching back to the old good int, it's just I guess I'm in trouble, because changing my typedef will immediately break a lot of things. I doubt even some Brute Force Programming techniques will help.

Upd: There is always a little dirty trick to make the verification above shorter with signed int, in case you care about microseconds:

if (unsigned(index) >= count)
error();

Updupd: A redditor mentioned Google C++ Style Guide says the same on unsigned ints: Integer types (expand the section to see details).

July 21, 2009

Brute Force Programming

Brute force programming is something many of us use in our everyday job anyway, but I don't think it has ever been formalized as a methodology or even given a name. Brute force programming has nothing to do with tree search or iterating over all possibilities. It is called so because the BFP tricks are brutal, dirty, may be creative, they are not documented to the best of my knowledge and generally not accepted as a standard practice in any solid organization. And BFP is mainly applicable with early-binding languages like C/C++, Pascal, assembly language, possibly also C# and Java.

So the formal definition I propose is: Brute force programming is finding creative ways of having the compiler reveal potential problems in the code before running it, if running and testing is less trivial than compiling, as is usually the case with big projects.

Let's take a look at some examples - just some examples that are in no way comprehensive, to give you an idea. After all, BFP is about creativity, and the possibilities are potentially endless here. Now, picture this - and you will find it very easy to do it - you work on a huge project which is on the brink of total unmaintainability, if not crossed the line already. Piles of code written by different people with different styles, approaches, understanding of what programming is about, from different epochs. You need to fix bugs and add features - that's your job, but the whole project is so unreliable that almost every change breaks something.

Method #1. Rename early, rename often: this is the simplest method in BFP and possibly the least creative. Let's say you slightly change implementation of some function or semantics of some variable, but it is used in so many places that there is a risk of introducing a bug. And you want to catch as many problems as possible before running your monster application. Grepping the sources is not reliable and also in case of a generic name like, for example, "close", you will get millions of matches and you will certainly miss something important.

The quickest way is to rename your object - function, variable, class, whatever it is, and watch compilation errors. Ideally, you need a good IDE to take you to the exact places where the compiler complains about unknown identifiers. Don't worry about the new name - you can rename it back later, once you make sure that all uses of the object are correct.

This may be especially useful when making subtle changes in function signatures, i.e. argument and return types. In C++ there are some strange type compatibility rules, for example booleans are compatible with pointers, so if you change one argument of your function from, say, const char* to bool, your code will compile happily, and some old parts of code will manage to send a string to your poor boolean argument. The solution is: just don't rely on the compiler and rename it brutally.

Method #2. Comment out if in doubt: in C++ there are situations when you can't be sure your function is actually called. The most famous problem is overridden virtual methods: one mistake in the name or in the signature of you overridden method - even the slightest incompatibility in arguments, and by the rules of function overloading in C++ you have a shiny new overloaded function which has nothing to do with the base virtual one. And it will never be called as a virtual method. GCC can give warnings in such cases if enabled, but it doesn't seem to be doing the job well.

The quickest (and the dirtiest, like everything else in BFP) is to: (1) make sure there is no "virtual" keyword in front of your method and (2) comment out the definition of your function - the body, that is. If your method prototype is correct, the linker will give some obscure error about the virtual method table - VMT. No, some compilers usually give a very good self-descriptive message saying that there is an unresolved function pointer in the VMT of your class, but at least GCC is not among such nice compilers.

So, comment out if in doubt.

Method #3. Declare but don't implement: my favorite. Very much like method #2, except you put a prototype of something that's not supposed to be called at all.

Example 1: let's say you have a group of overloaded functions that provide functionality for some data types but not the others. It may be, for example, that you provide overloads for bool, all kinds of ints (and there are plenty of them in C/C++) but not pointers. And you don't want someone to call your boolean version with a string. What you do is you declare overloads with a few different kinds of pointers but don't implement them.

Example 2. This is actually a well known trick with copy constructors and assignment operators: if copying of your class is non-trivial or may have some unwanted side effects, or you simply don't want to implement it, declare a private copy constructor and an assignment operator, and again, don't implement them. In most cases the compiler will complain even before the linking stage because of inaccessibility of these methods. I noticed though in a lot of places that only the copy constructor is masquaraded this way; beware that the same applies to the assignment operator, too: it can be called implicitly by some non-trivial and non-obvious C++ non-code. Ah, sorry, code.

Method #4. Hide and conquer: if you think your class exposes too much to the outside world, start hiding declarations one by one in the protected and even the private section, and as usual watch the compiler dancing around all uses of your declarations that disappeared from the sight. If you are lucky enough, it will turn out nobody uses some of the entities and so you will simply leave them where they belong - in the private/protected sections. This is always good.

Likewise, try to change the inheritance clause of your class from public to protected (i.e. class X: protected Y {...) and see what happens. If nothing actually happens, it means you have just conquered a piece of good land.

This method, thus, basically helps to minimize incorrect usage of your interfaces.

Method #5. The wall of werrors: this is actually the most canonical BFP method. If you are a GCC user, then wherever you see g++ or gcc invoked, add the -Wall -Werror options. In the C++ world warnings are called warnings by mistake: they are all potential problems and you have to fix them all, one by one. To make sure no "warning" slips away, you have to stop compilation on any "warning", hence -Werror. Exceptions can be made for some political kind of "warnings" that MSVC may give, for example, "function x is obsolete" in relation to some 40 years old UNIX function. We understand that Microsoft wants to make UNIX obsolete, but that's only in their dreams. These warnings can be disabled individually with "#pragma warning".

Did I miss something? Of course I did. In BFP there are a lot of one-off tricks that you invent on the fly just to catch something before running your application. Maybe you need to document what you did then, and share with us.

July 2, 2009

Feel the cache size: a definitive experiment

Long gone the wonderful era of 640 kilobytes of memory and 8 to 12 MHz of clock speed, which "ought to be enough for anybody". Though Mr.Gates never said that, I still think that's a fair amount of memory and good speed for a lot of things. I wouldn't have said that if I hadn't worked in a fully-blown windowed desktop publishing system called Quark XPress on a Macintosh LC, which had, I think, some 2 MB of RAM and ran at 16MHz, back in 1991. Time has passed, the download size of the latest Quark XPress is about half a gig now (as opposed to the prehistoric version that fit a few diskettes at the time), we have gigabytes of memory and we run at staggering gigahertz speeds, and yet the same program with basically same functionality (oh, okay, plus-minus) is slower on a system that's a thousand times more powerful! Or take Acrobat Reader that has grown 15 times in size, has become damning slow and resource-hungry across versions 3.0 through 9.0 while doing same old thing - rendering PostScript programs.

Shouldn't 2 gigabytes/2 gigahertz really be enough for almost anything we can think of today, except maybe some heavy real-world 3D simulations, including games? If you are not going to calculate the state of all atoms in the Solar system in real time, 2/2 ought to be enough for you. Um, if you write your code carefully and thoughtfully. I don't think binaries I produced ever reached even a few megabytes in size, although there were some fairly complicated projects I worked on with hundreds of thousands of LOC. Hundreds of thousands of lines, mind you, usually means a gigantic project on the brink of unmaintainability.

Unfortunately the reason bloatware exists is not only because of the fallacy "don't optimize your code, because Mr.Moore is going to do that for you soon". One other reason that becomes apparent if you look at bloatware closely, is that large part of it is commercial software. Commercial vendors want you to feel comfortable about the money you pay. Because otherwise, just picture for a moment that you paid $500 of your hard-earned money for something 500 KB in size that downloads faster than you can blink. Doesn't this picture feel somewhat wrong?

I can't think of some bosses, of course, sitting there at Adobe, Microsoft and Sun, forcing employees to deliver bigger binaries. The mechanism is a bit more subtle: it is, I think, a whole culture of languages and platforms that unobtrusively force developers to produce monster code. Yes, that's Java and C# with their verbosity coupled with everything-is-an-object philosophy, as well as the C++ world, where nothing practically stops you from reckless use of templates - possibly the bloater number one in the neighbourhood.

I'm just wondering when (and if) natural selection can finally come into play in the software industry and wipe out the dinosaur herd in favour of smaller, quicker and smarter species. There are some barely noticable signs this is happening now (startups), but clearly, it's too early to trumpet Nature's victory.

***

But enough of emotions. I'm going to carry out an experiment to prove that we still live in the world of kilo and mega rather than giga, at least memory-wise. Why? The CPU cache.

It shouldn't come as a surprise to you that there's CPU cache in your machine: usually a very fast memory chip from kilobytes to megabytes in size, designed to make memory operations faster. Unfortunately it comes at a price: the CPU cache is organized in small chunks called cache lines, typically 64 bytes in size, that are read from and written to the main memory at once, as a whole. Because every cache miss triggers a memory read of a whole 64-byte chunk, while all the processor requested was, say, 4 bytes, it means in certain circumstances caching makes things worse than if there was no cache at all.

So what I'm going to do is, build a program that can do something - actually same thing at every run, using code segments of varying sizes. We will change the code block from a few bytes to tens of megabytes and measure the difference in running time. Would it be fair? I think so, yes. We are not going to interpret the results though, as in how and why, but rather how much, that is, this is going to be a purely quantitative experiment.

The idea is this: I generate a C program with exactly one million functions, each returning some number:
int f0() { return 1; }
int f1() { return 2; }
int f2() { return 3; }
int f3() { return 4; }
int f4() { return 5; }
...

and so on, to one million. Each function translates to 11 bytes of code:
_f0:
pushq %rbp
movq %rsp, %rbp
movl %edi, -4(%rbp)
movl -4(%rbp), %eax
addl $1, %eax
leave
ret

I also generate a table of pointers to these functions so that we can pick them randomly in a tight loop.

Poor, poor GCC, it took almost 20 minutes to compile the file, even though I turned off optimizations. Anyway. The main module looks like this:
#include <stdlib.h>
#include <time.h>
#include <stdio.h>

#include "tab.h"

#define PRIME 99971
#define LOOP 1000000000


void run(int max)
{
time_t begin = time(NULL);
srand(PRIME);

int i;
for (i = 0; i < LOOP; i++)
func_table[rand() % max]();

int diff = (int)(time(NULL) - begin);
printf("code size: %ld time: %d secs\n",
func_table[max] - func_table[0], diff);
}


int main()
{
printf("Running...\n");
run(1);
run(10);
run(100);
run(1000);
run(10000);
run(100000);
run(1000000);
return 0;
}

As you can see, at every run we do equivalent job (same number of iterations, identical functions called), except bigger and bigger chunks of the code segment are involved. The results? Voila:
code size: 11  time: 12 secs
code size: 110 time: 22 secs
code size: 1100 time: 29 secs
code size: 110000 time: 36 secs
code size: 1100000 time: 45 secs
code size: 11000000 time: 174 secs

Notice the spike on the last line? That's a transition from 1MB to 10MB, whcih crosses the actual size of my L2 cache I have on my machine, and it's 2MB.

Anyway. The conclusion is, performing the same job using different binary sizes can be up to 15 times slower. That's it, and I'll leave you there deciding whether you should write simple, succinct, beautiful and fast code, or you should impress the rest of the world with "methodologies" you use and the monstrosity of your project. Sometimes size translates to money, true, but the opposite is your satisfaction with the job done. Haven't you decided yet?

Update: A lot of people pointed out that the code used in the experiment does not represent typical real-world situations. This is true, but let's say the impact in your case is 20% rather than 15 times. If it's a virtual machine for your language, for example, then optimization would be something you'd struggle for, as 20% for a runtime environment is a huge gain.

July 1, 2009

SQL: time to retire?

SQL is quite possibly the oldest language in wide use today. It begs for changes or even complete replacement, but the entire area of databases somehow remains very uninviting for innovation and paradigm shifts. Why? Good question, I don't know, and I don't think I care either...


Oh, and maybe that's the answer?

With the exception of a small group of highly skilled engineers who develop DBMS systems (and then that's not exactly database programming), the rest of the field is relatively low-skill, low-barrier, mostly non-engineering niche that doesn't look very attractive to, let's say, more or less accomplished, skilled engineers. I mean, of course you can use databases in your projects, but that's not your main expertise, neither you want it to be such, right? (Or if it is, I'm afraid this blog post is not for you).

Anyway... A few thoughts on SQL.

I'm sure many programmers would admit that they never liked SQL from the aesthetical point of view but they'd at least acknowledge its expressiveness and conciseness. I have great doubts about both.

Let's say you have a programming language that can handle persistent data, i.e. you declare variables and mark them as persistent, which means values assigned to these variables survive program reboots. We are interested in one particular case: a persistent array of structures (objects). That's a database table, essentially. Let's say the language is something Python-ish and let's try to express this queer construct:
SELECT customers.name, countries.name FROM customers
LEFT JOIN countries ON customers.country_id=countries.id
WHERE customers.id=2
in our shiny new magical language:
c = customers[2]
print c.name, countries[c.country_id]
Isn't it neat? I think the compiler can easily figure out what's required here and translate this piece to low-level database operations involving indexes if needed. Moreover, the entire epic of ID's and relations can be eliminated by some automatic object reference infrastructure that performs translations of run-time references to database ID's and back. Having that in mind, can we, as a slightly more complicated example, select and print multiple rows? Of course we can:
for c in customers:
if not c.suspended:
print c.name, c.country.name
Would you rather prefer the awkward, limited and unaesthetic SQL after seeing this?

We'll probably talk more about this idea some other time, but meanwhile I'd suggest a little fix to the existing SELECT statement. I think a small syntactic change like this:
SELECT FROM table GET column1, column2, ... WHERE ...
would make SELECT consistent with INSERT INTO table SET... and UPDATE table SET..., and besides, it would make command completion in the SQL shell perfectly possible: when entering column names after GET, the shell knows already in which tables to look for suggestions.

That's it for now. I think I'll gradually elaborate the idea of an imperative language with persistent data, because it sounds fantastic and pretty feasible.