Path: newsspool2.news.pas.earthlink.net!stamper.news.pas.earthlink.net!elnk-nf2-pas!newsfeed.earthlink.net!newshub.sdsu.edu!logbridge.uoregon.edu!newsfeed.cs.utexas.edu!news.cs.utexas.edu!not-for-mail Message-ID: From: djimenez@cs.utexas.edu (Daniel A. Jimenez) Newsgroups: sci.math Subject: Re: Second Call For Primecounter Contest Entries Date: 9 Sep 2003 07:40:40 -0500 References: <0j27b.3569$ej1.2908@nwrdny01.gnilink.net> Lines: 153 Organization: CS Dept, University of Texas at Austin Sender: djimenez@zorkmid.cs.utexas.edu NNTP-Posting-Host: zorkmid.cs.utexas.edu X-Trace: news.cs.utexas.edu 1063111242 7663 128.83.120.248 (9 Sep 2003 12:40:42 GMT) X-Complaints-To: usenet@cs.utexas.edu NNTP-Posting-Date: Tue, 9 Sep 2003 12:40:42 +0000 (UTC) X-Received-Date: Tue, 09 Sep 2003 05:40:48 PDT (newsspool2.news.pas.earthlink.net) Xref: lexi2.athghost7038suus.net sci.math:36716 In article , Stan Gula wrote: >"Daniel A. Jimenez" wrote in message >news:bjike9$1mb$1@zorkmid.cs.utexas.edu... >> >One more attempt and I have to retire. I am now showing times as good as >> >Daniel A. Jimenez for tests up to 10^8 and I'm beating him at 10^9. >> >> D'oh! Ok, here's a tweaked version of my last submission. It's about >> three times faster for large values. >> >- >> Daniel A. Jiménez djimenez@cs.utexas.edu >> Assistant Professor djimenez@cs.rutgers.edu >> Department of Computer Science >> Rutgers University > >Oh, if you want to play that game, I can bump my cache table. Making it >8192 values makes it three times faster. That's the magic of recursion - >there's a really easy way to cut branches off by increasing the cached >values. > >By the way, I like your choice of descriptive variable names. > >Your new version is much faster, however, I have the feeling I can push the >cached values as far as necessary... > >I tested my JSH/SG8192 versus your latest at 10^8 and 10^9 and the results >were: Wow. Ok, mine is crap. But we can improve yours a little. I made a few modifications to your program: 1. The cache is of size 65536, and memoized (i.e. computed dynamically) instead of precomputed. This modification gives the biggest boost, as we will see below. It also decreases the size of the source code by about a factor of 10 :-) 2. The 'JSHpi' function is modified to take only one parameter since the calling context makes it clear that we only need the second parameter once, and that instance can be replaced by a call to S2. This way, we get rid of an extra item on the stack, as well as a branch in JSHpi. 3. The (i-dx)*(i-dx) computation is replaced by a simpler computation that keeps a running total of (i-1)*2 each time through the loop; that's always equal to (i-dx)*(i-dx) and requires only a subtract by one and shift by one, which should be faster than a full-on multiply. 4. Miscellaneous hacks that probably don't help but make me feel better. Here are the times I get on an AMD Athlon XP 3200+: % /usr/bin/time -p ./yours-with-a-8192-precomputed-cache 1000000000 Prime count (JSH/SG): 50847534 real 0.52 user 0.52 sys 0.00 % /usr/bin/time -p ./with-my-modifications-except-for-the-first-one 1000000000 Prime count (JSH/SG): 50847534 real 0.48 user 0.49 sys 0.00 % /usr/bin/time -p ./with-all-my-modifications 1000000000 Prime count (JSH/SG): 50847534 real 0.35 user 0.36 sys 0.00 My modifications 2 through 4 also help on a Pentium III, but not a Pentium 4. By the way, I used the idea of coming up with faster and faster prime counting programs as an example of speedup in my graduate computer architecture class yesterday. I told the students they could develop their own prime counting programs as extra credit. I'll let you (all) know if they come up with something better. Here's Stan's code that I have modified: #include #include #include unsigned S2(unsigned x, unsigned yin); #define cache_size 65536 int f_cache[cache_size]; unsigned JSHpi1(unsigned xin) { // cache most common answers if (xin0 on the command line"<