π(n): A Call For Primecounters

Please note: These results are now final. The contest produced a number of interesting algorithms, and some surprising results.

What started all this?

Well, James S. Harris in sci.math posted a routine that purported to be a fast method of counting the number of primes from 1 to N, inclusive; mathematically this is denoted by π(N), at least according to http://mathworld.wolfram.com/PrimeCountingFunction.html. Several others (including myself) immediately chimed in, some with competing algorithms, and James Harris made another attempt at a fast prime counter, and I started collecting the things and making some of my own and, well, here we are; I now have some results.

Submission criteria.

Submissions were of course very informal (I'm not all that organized and not affiliated with any "formal math" org), but, since I'm a professional programmer who's been around the block at least once I feel I can (and did have to) use a shoehorn on a few entries; hopefully it hasn't hurt the results much. I'll attempt to give credit where credit is due, and as long as it's fairly obvious what to call where, I'll let minor issues slide if I can fix them. I also have some of my own, some good, some not so good.

The general idea was to implement a C or C++ function

unsigned long long pi(unsigned long long x);

which implements the prime counting function (I picked this form mostly because Carl W.'s post suggested it). My original post is here, for those who want the specifics.

One thing I boneheadedly left out of the original post was the actual deadline, which was rather silly of me. The contest started on 2003-09-05 and ended on 2003-09-15, as judged by my newsspool's clock (PDT). I use leafnode, and I'll admit this is a little parochial, but fortunately my copy of leafnode is set up to pull sci.math posts every 4 hours or so, connection permitting. I'll probably rank things Tuesday night anyway.

I'll accept E-mail submissions, if I see them; I get a *lot* of junk mail. You're probably better off posting. :-)

Build criteria.

The computer used is an old Optiplex GX1 with a 400 Mhz Pentium II and 320 megs memory (since even my sieve didn't use that much memory starvation isn't an issue), running a variant of Debian/x86 Linux (kernel version 2.4.20-1-686). The optimization flags I used are g++ -O2 -fPIC. The -fPIC is a bit of a mistake but I don't really want to change it now; it's an accidental inclusion from my quick-and-dirty setup of the initial project into my source code tree (which is primarily optimized for building shared libraries). Once built one can simply run the result, with a single argument.

$ cat /proc/cpuinfo
processor	: 0
vendor_id	: GenuineIntel
cpu family	: 6
model		: 5
model name	: Pentium II (Deschutes)
stepping	: 2
cpu MHz		: 399.009
cache size	: 512 KB
fdiv_bug	: no
hlt_bug		: no
f00f_bug	: no
coma_bug	: no
fpu		: yes
fpu_exception	: yes
cpuid level	: 2
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 mmx fxsr
bogomips	: 796.26

Judgement criteria.

I'm not really sure I have any specific hard-and-fast rules, which is a bit of a problem. However, I like the following:
  1. It should be small, light, and fast.
  2. It should be elegant from a mathematical perspective, if possible.
  3. It should be readable.
  4. It should compute the correct answer.

If you've been following my posts an informal test is to simply call pi(1000000) and ensure that it comes up with the right answer, which is the value 78498. This also allows for an estimate of the time complexity -- a single data point, anyway. I'm supplying the maindriver.C and driver.h; the former calls the submission, and the latter simply tells the former what the signature is.

Awards.

Awards are very simple; I don't have any! However, I will rank the results by time of execution, and, if I can analyze them, by depth of recursion stack and amount of memory consumed as well, and put the results here.

Copyrights

While I have mercilessly modified the code to fit into my straitjacket, I nevertheless otherwise recognize copyright of their rightful owners. Because of various stupid issues regarding E-mail and the 'Net, I won't publish non-munged addresses; however, it should not be difficult to see who did what. I've included the original posts for all submissions that aren't mine, that I could find on my newsspool.

The Test.

A little bit of thought went into the test -- but only a little. A bit of experimentation provided a structure which makes it very easy to add additional submissions.

Makefile

The Makefile is the key to the test. Makefiles are very useful for program building, but they can be used for almost any task that has dependencies of one file on another. Ergo, I simply named the executables whatever.elf and I'm off and running. The results are stored in whatever.test and are space-separated tokens, one run per line. A little Perl glue in test.perl and it's off to the races.
Format of each line of the .test file.
PosNameDescription
1NArgument to π(n)
2π(N)The result.
3sbrkDifference between sbrk(0) at the start of the test and sbrk(0) at the very end. This measurement is almost useless.
4realAs measured by time(1), the real time, in seconds. This measurement isn't all that useful.
5userAs measured by time(1), the user time, in seconds. This is arguably the most useful of the three.
6sysAs measured by time(1), the system time, in seconds. Since the programs are not using system routines this is almost always 0.

Plotting

Obviously this sort of test generates a lot of results, so to make sense of them I threw in gnuplot, a free and highly capable plotting package. Gnuplot supports loading of scripts so the Perl script genplots.perl is used for generation thereof; makes things a lot easier for me. A little experimentation with the preliminary submissions to date determined that the best plot is a semilog plot, although gnuplot doesn't support this directly so I have to fudge it with the log() function. However, there are so many entries the plots are almost useless now.

Speaking of which...

The Entries!

Here is where I list the entries, in chronological order where I remember it.

James Harris' first attempt.

James Harris originally thought up this interesting, if slightly incomprehensible to me, variant; the routine calls itself and does get the right answer. It's reasonably elegant but slow. The Post. jamesharrisprime.C

James Harris' first attempt, substituted

The post also suggests changing doubles to ints. I have done so in this variant. It didn't help much. jamesharrisprimeint.C

Erastothene's Sieve

A classic. This is my own code, in response. I would have thought this was the fastest but alas, it is not! It might gain some points for elegance. But then, I wrote it, so am I the best judge in this case? sieve.C

Sieve, boolean variant

You'll probably have noticed that the original sieve was coded using a std::vec<char>. Using a bool instead slows things down but reduces the memory usage by a factor of about 8. sievebool.C

Sieve6, throwing out 2's and 3's early

This one is now working. It attempts to filter out multiples of 2 and 3 very early on, only allowing for the numbers (6k-1) and (6k+1) for k > 0. While testing it I ran into some interesting bugs caused by too-short variables; specifically, it was at one point too low by one while calculating pi(532415). The correct value is 44005. It turns out this is the best of my sieves, most likely because it doesn't test as many bits. sieve6bool.C

Seqtest

This is a very silly but straightforward implementation of mine which simply tests succeeding numbers for primeness, although it does test only values of the form (6k-1) and (6k+1), k > 0. The test for each N is conducted using the part of the prime array constructed so far, up to and including floor(sqrt(N)). Despite the silliness it's actually quite fast, if one can live with the prime array under construction during the count. seqtest.C

Legendre's φ (phi) function

This entry (my own) takes into account the relation π(x) = φ(x,π(a)) - 1 + a, where a = π(floor(sqrt(x))). This makes it recursive, although the stack depth is fairly short. Turns out the run time is, too, although my original coding didn't quite work; a few helpful posts later, however, led to this working submission. The Post. legendrephi.C

Carl W.'s Algorithm.

This entry has the virtue of extreme simplicity, taking no space for scratch arrays as sieve does, and no recursion. It was submitted by Carl W. The Post. carlw.C

Christian Bau's Algorithm.

This one's even simpler, if a bit on the brain-dead side, as it uses all integers from 1 to sqrt(n) to check if a given integer n is prime, not merely the primes from 1 to sqrt(n). But when one has no scratch space and only a few registers, there's only so much one can do. The original post contained a few errors, which were corrected by another poster whose name I've forgotten. The Post. christianbau.C

C. Bond's first submission.

A modification of Christian Bau's method. The Post. cbond.C

C. Bond's second submission.

Another modification of Christian Bau's method, with a few more special cases. The Post. cbond2.C

James Harris' third attempt.

After all of the lambasting regarding JSH's algorithm, JSH created this variant, which basically took his snail and attached an Acme Rocket. The result is surprisingly speedy. The Post. jamesharrisprime3.C

Daniel A. Jimenez

An impressive if somewhat complicated submission from Mr., Jimenez, recorded 2003-09-05. It even beat sieve, which I was not expecting at all. Its main drawback appears to be the use of 65,536 element char arrays on the stack. The Post. daniel.C

binarysearch

Another submission by yours truly. This one's extremely ugly (it's a binary lookup into a table of primes from 1 to 20 million) but also extremely fast -- though I'm wondering if load time will count in my benchmarking; the prime table is 5 megabytes in size! The actual table was computed using sieve and generated in assembly -- it turns out GNU C++ segfaults on such large data structures. It's a nice illustration of the power and simplicity of STL and it has a certain beauty engineering wise, but from a mathematical elegance standpoint, it stinks. binarysearch.C (sans table) Relevant primetable.S excerpts.

Obfuscated Prime Generator

This is apparently one of Minor Crank's contributions to the debate. I've rejected it mostly because (a) I didn't see it before scanning my newsspool, (b) it's so abstruse as to defy meaningful analysis, and (c) it doesn't fit my requirements. I include it here mostly for completeness. Sorry, Minor. It is an interesting implementation, though. The Post.

David Bernier

This is a modification of either Christian Bau's or C. Bond's text, posted by David Bernier and detected after the fact. Sorry David. Fortunately a newsspool scan found your submission, even if you didn't know it. :-) The Post. davidbernier.C

Jason Rodgers

This one came in via private Email. It contains a number of minor technical flaws, mostly in C++ memory handling, but makes the cut. It appears to be a reimplementation of a simple sieve, optimized for high speed, and sandwiched itself between sieve and sievebool in my testing. jrodgers.C

Stan Gula

This is a modification of James Harris' algorithm which removes the need for sqrt() entirely -- even in the initial call, which was an error I made while preparing this submission. The Post. stangula.C

Stan Gula #2

Stan then made another submission, improving his algorithm. The Post. stangula2.C

Daniel Jiminez #2

Not to be outdone, Daniel submitted this entry. The Post. daniel2.C

Daniel Jiminez #3

... and then modified Stan's algorithm to include a cache. The Post. daniel3.C

JRodgers #2

A nearly trivial modification should double the speed. jrodgers2.C

LegendrePhi #2

An anonymous submitter (thank you) suggested some changes to my Legendrephi routines, basically adding a cache and changing doubles to unsigned or int. The changes are very straightforward. I've taken the liberty of folding the initialization into my code instead of using a separate "piQ", and changing "magic numbers" to named constants. legendrephi2.C

JRodgers #3

Another nearly trivial modification should increase the speed even more. jrodgers3.C

Stan Gula #3

This one came by private Email. I'll admit it's an interesting solution but that ungainly array is ridiculously huge. stangula3.C

Edgar Binder

Another private Email submission. This one had to be very slightly modified. The Email purports O(sqrt(n) * log(n)) memory but I can't say I've verified it. The original had a 4095-size cache, which wasn't big enough; doubling its size, however, allowed computation of pi(10M) without crashing. edgar.C

Daniel Jiminez #4

A few tweaks and another entry. The Post. daniel4.C

Index lookup.

Another entry in the "very very ugly" category by yours truly. The table is easily built (I used my sieve for the actual bits) but is absolutely huge, over 80 MB in size. Therefore this method, while ultra-fast, is of only theoretical interest; it is worth noting that the assembly (using as) of pitab.o took over 15 minutes, of which only a small fraction was the actual sieving. No, I'm not about to excerpt the .S table; it's way too big. :-) index.C pitab.h

Itty Bitty Bitcounter

This one's simple and stupid, using a prefabricated table consisting of bits. Each bit is 1 if the number is a prime. The size of the table is about 8 MB, and bitty just counts the bits. Of course the bit table is generated using my siever. bitty.C bittable.h bittable.S excerpt

Herb Savage

Out of the blue comes this entry, which did very well, and is prettily done. It appears to be a variant of Legendre's algorithm, with memoization. The Post savage.C

Jason Rodgers #4

Another tweak, presumably. The Post jrodgers4.C

Herb Savage #2

More memoization vaulted this entry over stangula3. The Post savage2.C

Edgar #3

Another submission by Edgar. I seem to have missed #2; the private E-mail I did receive identified the routine using edgar3.cpp so I'm calling this #3. edgar3.C

The Results.

So far, all of the submissions (including a repaired sieve6bool, which was an off-by-1 error, and a repaired edgar, which needed a larger cache) computed the correct answers.

And now the moment you've been waiting for....(drumroll)...

Tabular form, last datapoint.

These are from the 10M datapoint -- the last line of each of the .test files. Some of the algorithms appear to have better performance at large numbers.
Results, raw
entryπ(10M)sbrkrealusersys
jamesharrisprime6645790189.345189.290
jamesharrisprimeint6645790181.224181.210
jamesharrisprime3664579012.02312.020
sieve66457902.2722.210.06
sievebool66457904.2424.240
sieve6bool66457901.1481.140.01
seqtest6645791074825.40525.380.03
legendrephi664579142165.5935.590
carlw6645790136.822136.810
cbond664579088.64388.640
christianbau6645790399.74396.230.14
cbond2664579049.51349.060.02
daniel66457900.6890.680
binarysearch66457900.0130.010.01
davidbernier664579082.56582.220.01
jrodgers66457907.0183.620
stangula66457909.1035.10
stangula266457903.2410.890
daniel266457901.2290.350.01
daniel366457900.350.080.01
jrodgers2664579012.062.980
legendrephi2664579141720.240.150
jrodgers366457904.7532.60.04
stangula366457900.0390.030.01
edgar66457903.1051.60.01
daniel466457900.080.040.01
bitty66457900.4410.320
index66457900.0660.010.01
savage66457900.0760.050.01
jrodgers466457902.4622.20.02
savage266457900.0350.030
edgar366457900.0480.030.02

Results, ordered by user time
entryπ(10M)sbrkrealusersys
binarysearch66457900.0130.010.01
index66457900.0660.010.01
edgar366457900.0480.030.02
savage266457900.0350.030
stangula366457900.0390.030.01
daniel466457900.080.040.01
savage66457900.0760.050.01
daniel366457900.350.080.01
legendrephi2664579141720.240.150
bitty66457900.4410.320
daniel266457901.2290.350.01
daniel66457900.6890.680
stangula266457903.2410.890
sieve6bool66457901.1481.140.01
edgar66457903.1051.60.01
jrodgers466457902.4622.20.02
sieve66457902.2722.210.06
jrodgers366457904.7532.60.04
jrodgers2664579012.062.980
jrodgers66457907.0183.620
sievebool66457904.2424.240
stangula66457909.1035.10
legendrephi664579142165.5935.590
jamesharrisprime3664579012.02312.020
seqtest6645791074825.40525.380.03
cbond2664579049.51349.060.02
davidbernier664579082.56582.220.01
cbond664579088.64388.640
carlw6645790136.822136.810
jamesharrisprimeint6645790181.224181.210
jamesharrisprime6645790189.345189.290
christianbau6645790399.74396.230.14

Performance plot.

Performance plot, log Y.

Algorithm/space analysis

The routines can be divided up into the following categories. (And I think I actually have them correctly done, now. Shows how much sleep I've been getting lately...no, it's not the contest. But that's a post for another day.)

Nonrecursive small space

These are the ones which take no space except for scratch registers and/or a small number of variables on the stack. I for one would consider these the most elegant. These are ranked in speed order.

Nonrecursive large space, stack

These are nonrecursive but requires a large amount of stack space.

Nonrecursive large space, Read-Only Memory (ROM)

These do not call themselves but use large amounts of carefully prepared read-only memory space.

Nonrecursive large space, dynamic

These do not call themselves but use large amounts of dynamically-allocated memory space.

Recursive small space

These call a part of themselves to compute the result, and otherwise use nothing but scratch registers and/or a small number of variables on the stack. These are also ranked in speed order.

Recursive large space, ROM

These call themselves and use a large amount of carefully prepared read-only memory space.

Recursive large space, static

These call themselves and use a large static area.

Recursive large space, dynamic

These call themselves and use large amounts of dynamically-allocated memory space.

Recursive large space, static and dynamic

These call themselves and use large amounts of both static and dynamically-allocated memory space.

A note on memoization.

Briefly, memoization refers to the caching of intermediate results. I'd have a more specific definition here but neither Merriam-Websters or dictionary.reference.com seems to even know the word exists.

As the contest progressed, it became clear that a fair number of entries took advantage of this speedup method -- which is essentially a tradeoff of increased space for increased time -- and rocketed past the competition. I had anticipated the notion of categorizing the algorithms but had not thought of this specific method. Unfortunately, implementation of this method usually detracts from elegance, and increases the number of lines of code.

Conclusion

Raw, unadulterated speed.

I thought I may have to declare joint winners, as Mr. Jimenez and Stan Gula seem to be collaborating. :-) As it is, it turns out I still do because of resolution problems with the user time timer; Stan Gula's third submission (stangula3) was the fastest but savage2, which came out of nowhere I can identify, and edgar3 blasted into a three-way tie for first place, with daniel4, Daniel's fourth attempt, placing fourth. (I can't count binarysearch or index, although both are faster, because they're just too damned ugly. bitty wasn't all that fast.) A "runoff" at 20M and 30M shows that edgar3 is the fastest of the three, though. Apologies to Christian Bau and his christianbau submission; it brings up the rear, despite (or perhaps because) of its extreme simplicity. :-)

Elegance and simplicity.

These are of course my own judgement. Feel free to disagree! :-)

Nonrecursive, no memory.

christianbau is arguably the simplest but unfortunately was the slowest of all entries. cbond2 was the fastest entry here but still took 49 seconds; none of these were speed demons.

Nonrecursive, memory.

This is a combination of all other nonrecursive entries. binarysearch and index were the fastest nonrecursive algorithms -- in fact, the fastest algorithms, period. However, they took huge amounts of ROM space; index in particular was over 80 MB in size, with binarysearch consuming 5 megabytes. bitty took 2 1/2 MB and did reasonably well, and my sieves are very simple, but these are all mine; is that fair?

Of the ones that are left, jrodgers4 gets the nod, for his implementation of a bit class.

Recursive, no memory.

stangula gets the nod here performancewise, but jamesharrisprime3 is more elegant. Surprise, James. You've won a consolation prize. :-)

Recursive, memory.

Most of the entries fall here, admittedly. Of these, stangula2 and stangula3 were the most interesting, illustrating how memoization can speed up things to ridiculous extremes. (I should note that the logical culmination of these, however, is index, my own entry.) legendrephi2, my own entry, did well in the testing but edgar3, savage2, daniel4 and savage beat it. Of these, edgar3 gets the nod -- but all were good implementations.

Fin

The contest is now closed, but I hope you had as much fun writing these as I had evaluating them. Thanks to all who participated. :-)

Up