π(n): A Call For Primecounters
Please note: These results are now final. The contest
produced a number of interesting algorithms, and some
surprising results.
What started all this?
Well, James S. Harris in sci.math posted a routine
that purported to be a fast method of counting the
number of primes from 1 to N, inclusive; mathematically
this is denoted by π(N), at least according to
http://mathworld.wolfram.com/PrimeCountingFunction.html.
Several others (including myself) immediately chimed in,
some with competing algorithms, and James Harris made
another attempt at a fast prime counter, and I started
collecting the things and making some of my own and, well,
here we are; I now have some results.
Submission criteria.
Submissions were of course very informal (I'm not all that
organized and not affiliated with any "formal math" org),
but, since I'm a professional programmer who's been around
the block at least once I feel I can (and did have to) use
a shoehorn on a few entries; hopefully it hasn't hurt the
results much. I'll attempt to give credit where credit
is due, and as long as it's fairly obvious what to call
where, I'll let minor issues slide if I can fix them.
I also have some of my own, some good, some not so good.
The general idea was to implement a C or C++ function
unsigned long long pi(unsigned long long x);
which implements the prime counting function (I picked this form
mostly because Carl W.'s post suggested it). My original post
is here, for those who want the specifics.
One thing I boneheadedly left out of the original
post was the actual deadline, which was rather silly
of me. The contest started on 2003-09-05 and ended on
2003-09-15, as judged by my newsspool's clock (PDT).
I use leafnode, and I'll admit this is a little
parochial, but fortunately my copy of leafnode is set
up to pull sci.math posts every 4 hours or so,
connection permitting. I'll probably rank things Tuesday
night anyway.
I'll accept E-mail submissions, if I see them; I get a
*lot* of junk mail. You're probably better off posting. :-)
Build criteria.
The computer used is an old Optiplex GX1 with a 400 Mhz
Pentium II and 320 megs memory (since even my sieve didn't
use that much memory starvation isn't an issue), running a
variant of Debian/x86 Linux (kernel version 2.4.20-1-686).
The optimization flags I used are g++ -O2 -fPIC.
The -fPIC is a bit of a mistake but I don't really
want to change it now; it's an accidental inclusion from my
quick-and-dirty setup of the initial project into my source
code tree (which is primarily optimized for building shared
libraries). Once built one can simply run the result, with
a single argument.
$ cat /proc/cpuinfo
processor : 0
vendor_id : GenuineIntel
cpu family : 6
model : 5
model name : Pentium II (Deschutes)
stepping : 2
cpu MHz : 399.009
cache size : 512 KB
fdiv_bug : no
hlt_bug : no
f00f_bug : no
coma_bug : no
fpu : yes
fpu_exception : yes
cpuid level : 2
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 mmx fxsr
bogomips : 796.26
Judgement criteria.
I'm not really sure I have any specific hard-and-fast rules, which is
a bit of a problem. However, I like the following:
- It should be small, light, and fast.
- It should be elegant from a mathematical perspective, if possible.
- It should be readable.
- It should compute the correct answer.
If you've been following my posts an informal test is to simply call
pi(1000000) and ensure that it comes up with the right answer,
which is the value 78498. This also allows for an estimate
of the time complexity -- a single data point, anyway. I'm supplying
the maindriver.C
and driver.h; the former calls the submission,
and the latter simply tells the former what the signature is.
Awards.
Awards are very simple; I don't have any! However, I will rank the
results by time of execution, and, if I can analyze them, by
depth of recursion stack and amount of memory consumed as well,
and put the results here.
Copyrights
While I have mercilessly modified the code to fit into my straitjacket,
I nevertheless otherwise recognize copyright of their rightful owners.
Because of various stupid issues regarding E-mail and the 'Net, I
won't publish non-munged addresses; however, it should not be difficult
to see who did what. I've included the original posts for all submissions
that aren't mine, that I could find on my newsspool.
The Test.
A little bit of thought went into the test -- but only a little. A
bit of experimentation provided a structure which makes it very easy
to add additional submissions.
Makefile
The Makefile is the key to the
test. Makefiles are very useful for program building,
but they can be used for almost any task that has
dependencies of one file on another. Ergo, I simply
named the executables whatever.elf
and I'm off and running. The results are stored in
whatever.test and are space-separated
tokens, one run per line. A little Perl glue in test.perl and it's off to the races.
Format of each line of the .test file.
| Pos | Name | Description |
| 1 | N | Argument to π(n) |
| 2 | π(N) | The result. |
| 3 | sbrk | Difference between sbrk(0) at the start of the test and sbrk(0) at the very end. This measurement is almost useless. |
| 4 | real | As measured by time(1), the real time, in seconds. This measurement isn't all that useful. |
| 5 | user | As measured by time(1), the user time, in seconds. This is arguably the most useful of the three. |
| 6 | sys | As measured by time(1), the system time, in seconds. Since the programs are not using system routines this is almost always 0. |
Plotting
Obviously this sort of test generates a lot of results, so to make sense of
them I threw in gnuplot, a free and
highly capable plotting package. Gnuplot supports loading of scripts
so the Perl script genplots.perl is used
for generation thereof; makes things a lot easier for me. A little
experimentation with the preliminary submissions to date determined
that the best plot is a semilog plot, although gnuplot doesn't support
this directly so I have to fudge it with the log() function.
However, there are so many entries the plots are almost useless now.
Speaking of which...
The Entries!
Here is where I list the entries, in chronological order
where I remember it.
James Harris' first attempt.
James Harris originally thought up this interesting, if slightly
incomprehensible to me, variant; the routine calls itself and
does get the right answer. It's reasonably elegant but slow.
The Post.
jamesharrisprime.C
James Harris' first attempt, substituted
The post also suggests changing doubles to ints.
I have done so in this variant. It didn't help much.
jamesharrisprimeint.C
Erastothene's Sieve
A classic. This is my own code, in response. I would have thought
this was the fastest but alas, it is not! It might gain
some points for elegance. But then, I wrote it, so am I the
best judge in this case?
sieve.C
Sieve, boolean variant
You'll probably have noticed that the original sieve was coded using
a std::vec<char>. Using a bool instead
slows things down but reduces the memory usage by a factor of about 8.
sievebool.C
Sieve6, throwing out 2's and 3's early
This one is now working. It attempts to filter out
multiples of 2 and 3 very early on, only allowing for the
numbers (6k-1) and (6k+1) for k > 0. While testing
it I ran into some interesting bugs caused by too-short
variables; specifically, it was at one point too low by one
while calculating pi(532415). The correct value is 44005.
It turns out this is the best of my sieves, most likely
because it doesn't test as many bits.
sieve6bool.C
Seqtest
This is a very silly but straightforward implementation of
mine which simply tests succeeding numbers for primeness,
although it does test only values of the form (6k-1) and
(6k+1), k > 0. The test for each N is conducted using
the part of the prime array constructed so far, up to and
including floor(sqrt(N)). Despite the silliness it's actually
quite fast, if one can live with the prime array under
construction during the count.
seqtest.C
Legendre's φ (phi) function
This entry (my own) takes into account the relation π(x) = φ(x,π(a)) - 1 + a,
where a = π(floor(sqrt(x))). This makes it recursive, although the
stack depth is fairly short. Turns out the run time is, too, although
my original coding didn't quite work; a few helpful posts later, however,
led to this working submission.
The Post.
legendrephi.C
Carl W.'s Algorithm.
This entry has the virtue of extreme simplicity, taking no space for
scratch arrays as sieve does, and no recursion. It was submitted
by Carl W.
The Post.
carlw.C
Christian Bau's Algorithm.
This one's even simpler, if a bit on the brain-dead side, as it
uses all integers from 1 to sqrt(n) to check if a given integer n
is prime, not merely the primes from 1 to sqrt(n). But when
one has no scratch space and only a few registers, there's only
so much one can do. The original post contained a few errors,
which were corrected by another poster whose name I've forgotten.
The Post.
christianbau.C
C. Bond's first submission.
A modification of Christian Bau's method.
The Post.
cbond.C
C. Bond's second submission.
Another modification of Christian Bau's method, with a few more
special cases.
The Post.
cbond2.C
James Harris' third attempt.
After all of the lambasting regarding JSH's algorithm, JSH created
this variant, which basically took his snail and attached an Acme
Rocket. The result is surprisingly speedy.
The Post.
jamesharrisprime3.C
Daniel A. Jimenez
An impressive if somewhat complicated submission from Mr., Jimenez,
recorded 2003-09-05. It even beat sieve, which I was not expecting
at all. Its main drawback appears to be the use of 65,536 element
char arrays on the stack.
The Post.
daniel.C
binarysearch
Another submission by yours truly. This one's extremely
ugly (it's a binary lookup into a table of primes from 1 to
20 million) but also extremely fast -- though I'm wondering
if load time will count in my benchmarking; the prime table
is 5 megabytes in size! The actual table was computed
using sieve and generated in assembly -- it turns
out GNU C++ segfaults on such large data structures.
It's a nice illustration of the power and simplicity of STL
and it has a certain beauty engineering wise,
but from a mathematical elegance standpoint, it stinks.
binarysearch.C (sans table)
Relevant primetable.S excerpts.
Obfuscated Prime Generator
This is apparently one of Minor Crank's contributions to the debate. I've
rejected it mostly because (a) I didn't see it before scanning my newsspool,
(b) it's so abstruse as to defy meaningful analysis, and (c) it
doesn't fit my requirements. I include it here mostly for completeness.
Sorry, Minor. It is an interesting implementation, though.
The Post.
David Bernier
This is a modification of either Christian Bau's or C. Bond's text,
posted by David Bernier and detected after the fact. Sorry David.
Fortunately a newsspool scan found your submission, even if you
didn't know it. :-)
The Post.
davidbernier.C
Jason Rodgers
This one came in via private Email. It contains a number of minor
technical flaws, mostly in C++ memory handling, but makes the cut.
It appears to be a reimplementation of a simple sieve, optimized
for high speed, and sandwiched itself between sieve
and sievebool in my testing.
jrodgers.C
Stan Gula
This is a modification of James Harris' algorithm which removes the need
for sqrt() entirely -- even in the initial call, which was an error
I made while preparing this submission.
The Post.
stangula.C
Stan Gula #2
Stan then made another submission, improving his algorithm.
The Post.
stangula2.C
Daniel Jiminez #2
Not to be outdone, Daniel submitted this entry.
The Post.
daniel2.C
Daniel Jiminez #3
... and then modified Stan's algorithm to include a cache.
The Post.
daniel3.C
JRodgers #2
A nearly trivial modification should double the speed.
jrodgers2.C
LegendrePhi #2
An anonymous submitter (thank you) suggested some changes to my Legendrephi
routines, basically adding a cache and changing doubles to unsigned or int.
The changes are very straightforward. I've taken the liberty of
folding the initialization into my code instead of using a separate
"piQ", and changing "magic numbers" to named constants.
legendrephi2.C
JRodgers #3
Another nearly trivial modification should increase the speed even more.
jrodgers3.C
Stan Gula #3
This one came by private Email. I'll admit it's an interesting solution
but that ungainly array is ridiculously huge.
stangula3.C
Edgar Binder
Another private Email submission. This one had to be very
slightly modified. The Email purports O(sqrt(n) * log(n))
memory but I can't say I've verified it. The original had
a 4095-size cache, which wasn't big enough; doubling its size,
however, allowed computation of pi(10M) without crashing.
edgar.C
Daniel Jiminez #4
A few tweaks and another entry.
The Post.
daniel4.C
Index lookup.
Another entry in the "very very ugly" category by yours truly.
The table is easily built (I used my sieve for the actual bits)
but is absolutely huge, over 80 MB in size. Therefore
this method, while ultra-fast, is of only theoretical interest;
it is worth noting that the assembly (using as) of
pitab.o took over 15 minutes, of which only a small
fraction was the actual sieving. No, I'm not about to excerpt
the .S table; it's way too big. :-)
index.C
pitab.h
Itty Bitty Bitcounter
This one's simple and stupid, using a prefabricated table consisting
of bits. Each bit is 1 if the number is a prime. The size of the
table is about 8 MB, and bitty just counts the bits. Of
course the bit table is generated using my siever.
bitty.C
bittable.h
bittable.S excerpt
Herb Savage
Out of the blue comes this entry, which did very well,
and is prettily done. It appears to be a variant of
Legendre's algorithm, with memoization.
The Post
savage.C
Jason Rodgers #4
Another tweak, presumably.
The Post
jrodgers4.C
Herb Savage #2
More memoization vaulted this entry over stangula3.
The Post
savage2.C
Edgar #3
Another submission by Edgar. I seem to have missed #2; the
private E-mail I did receive identified the routine using
edgar3.cpp so I'm calling this #3.
edgar3.C
The Results.
So far, all of the submissions (including a repaired sieve6bool, which
was an off-by-1 error, and a repaired edgar, which needed a larger
cache) computed the correct answers.
And now the moment you've been waiting for....(drumroll)...
Tabular form, last datapoint.
These are from the 10M datapoint -- the last line of each of the .test files.
Some of the algorithms appear to have better performance at large numbers.
Performance plot.
Performance plot, log Y.
Algorithm/space analysis
The routines can be divided up into the following categories.
(And I think I actually have them correctly done, now. Shows
how much sleep I've been getting lately...no, it's not the
contest. But that's a post for another day.)
Nonrecursive small space
These are the ones which take no space except for scratch registers
and/or a small number of variables on the stack. I for one would
consider these the most elegant. These are ranked in speed order.
Nonrecursive large space, stack
These are nonrecursive but requires a large amount of stack space.
Nonrecursive large space, Read-Only Memory (ROM)
These do not call themselves but use large amounts of carefully
prepared read-only memory space.
Nonrecursive large space, dynamic
These do not call themselves but use large amounts of dynamically-allocated
memory space.
Recursive small space
These call a part of themselves to compute the result,
and otherwise use nothing but scratch registers and/or a
small number of variables on the stack. These are also
ranked in speed order.
Recursive large space, ROM
These call themselves and use a large amount of carefully
prepared read-only memory space.
Recursive large space, static
These call themselves and use a large static area.
Recursive large space, dynamic
These call themselves and use large amounts of dynamically-allocated memory space.
Recursive large space, static and dynamic
These call themselves and use large amounts of both static
and dynamically-allocated memory space.
A note on memoization.
Briefly, memoization refers to the caching of intermediate results.
I'd have a more specific definition here but neither Merriam-Websters
or dictionary.reference.com seems to even know the word exists.
As the contest progressed, it became clear that a fair number
of entries took advantage of this speedup method -- which is
essentially a tradeoff of increased space for increased time -- and
rocketed past the competition. I had anticipated the notion of
categorizing the algorithms but had not thought of this specific
method. Unfortunately, implementation of this method usually
detracts from elegance, and increases the number of lines of code.
Conclusion
Raw, unadulterated speed.
I thought I may have to declare joint winners, as
Mr. Jimenez and Stan Gula seem to be collaborating. :-)
As it is, it turns out I still do because of resolution
problems with the user time timer; Stan Gula's third
submission (stangula3) was the fastest
but savage2, which came out of nowhere I
can identify, and edgar3 blasted into a
three-way tie for first place, with daniel4,
Daniel's fourth attempt, placing fourth.
(I can't count binarysearch or index,
although both are faster, because they're just too damned
ugly. bitty wasn't all that fast.) A "runoff"
at 20M and 30M shows that edgar3 is the fastest of the three,
though.
Apologies to Christian Bau and his christianbau
submission; it brings up the rear, despite (or perhaps
because) of its extreme simplicity. :-)
Elegance and simplicity.
These are of course my own judgement. Feel free to
disagree! :-)
Nonrecursive, no memory.
christianbau is arguably the simplest
but unfortunately was the slowest of all entries.
cbond2 was the fastest entry here but still
took 49 seconds; none of these were speed demons.
Nonrecursive, memory.
This is a combination of all other nonrecursive entries.
binarysearch and index were the fastest
nonrecursive algorithms -- in fact, the fastest algorithms,
period. However, they took huge amounts of ROM space;
index in particular was over 80 MB in size, with
binarysearch consuming 5 megabytes. bitty
took 2 1/2 MB and did reasonably well, and my sieves are
very simple, but these are all mine; is that fair?
Of the ones that are left, jrodgers4
gets the nod, for his implementation of a bit class.
Recursive, no memory.
stangula gets the nod here performancewise,
but jamesharrisprime3 is more elegant.
Surprise, James. You've won a consolation prize. :-)
Recursive, memory.
Most of the entries fall here, admittedly. Of these,
stangula2 and stangula3 were the most
interesting, illustrating how memoization can speed up
things to ridiculous extremes. (I should note that
the logical culmination of these, however, is index,
my own entry.) legendrephi2, my own entry, did
well in the testing but edgar3, savage2, daniel4
and savage beat it. Of these, edgar3
gets the nod -- but all were good implementations.
Fin
The contest is now closed, but I hope you had as much
fun writing these as I had evaluating them. Thanks to
all who participated. :-)
Up