Prime Time Programming, Part 2

2011-05-09 Development Development / Primes / Profiling / ProjectEuler / Python

Last time I presented a truly horrible prime number generator I was using for Project Euler problems. Then I presented a revamped generator that used trial division. By adding various refinements to the generator, we saw the time required to generate primes less than 10⁷ shrink from hours to 123 seconds. Today I'll describe a different approach that's even more effective.

Attempt 2: Sieve of Eratosthenes

The Sieve of Eratosthenes is another method for finding prime numbers. The algorithm is basically this:

make a big array of numbers, from 2 to the highest prime you're hoping to find
look for the next number that's not crossed off
this number is your next prime
cross off every multiple of the number you just found
so long as the prime you just found is less than the square root of your limit, go to step 2
the uncrossed numbers are prime

Suppose we want primes less than or equal to 20. We start with this list:

The first uncrossed off number is 2. That's our first prime. Cross off all the multiples of 2 (believe it or not, the 4 is crossed off):

The next uncrossed off number is 3. Cross off the multiples:

Next, we have 5. Cross off its multiples (actually, they're already crossed off because they're all also multiples of either 2 or 3):

5 is greater than √20, so stop looping. All the uncrossed off numbers are prime:

A "borrowed" implementation

The wikipedia article even provides a Python implementation, which I reproduce here in slightly altered form:

def eratosthenes_sieve(m):
    # Create a candidate list within which non-primes will be
    # marked as None; only candidates below sqrt(m) need be checked. 
    candidates = range(m+1)
    fin = int(m**0.5)

    # Loop over the candidates, marking out each multiple.
    for i in xrange(2, fin+1):
        if not candidates[i]:
            continue

        candidates[2*i::i] = [None] * (m//i - 1)

    # Filter out non-primes and return the list.
    return [i for i in candidates[2:] if i]

I ran this to find my standard list of primes less than 10⁷, and was surprised at the results. The time to completion varied wildly on successive runs. Sometimes over 50 seconds, and sometimes as low as 13. I noticed that when the run times were high, the laptop's hard drive was thrashing, and just afterward my other applications were unresponsive.

I reran the test and, with a little help from PerfMon, found out that the memory usage was off the chart. No, seriously. Right off the top. I had to rescale the graph to get everything to fit. the Private Bytes went way up over 200MB. On my 512 MB laptop with Firefox, emacs, and a few background processes, things slow to a crawl. With a smaller set of primes, or on more powerful iron, this implementation may work, but it's not going to meet my needs.

Attempt 3: The Sieve, but slowly cross out the composites

If allocating a big array of numbers just to cross most of them out right away doesn't work, how about we start by allocating nothing and just cross out numbers at the last moment? The idea is pretty simple: start counting at 2, and keep a record of upcoming composite numbers that we've discovered by looking at multiples of primes so far. Essentially we maintain a rolling wave of the next multiples of 2, 3, 5, …:

let composites = {}, an empty associative array, where each key is a composite number and its value is the prime that it's a multiple of
let n = 2
if n is a known composite, remove it and insert the next multiple in the list
otherwise, it's prime. Yield it and insert a new composite, n²
increment n
go to step 3

Let's see how it works

n	composites
2	`{}`	2 isn't in composites, so yield it. Then insert 2² = 4 and increment n.
3	`{4:2}`	3 isn't in composites, so yield it. Then insert 3² = 9 and increment n.
4	`{4:2, 9:3}`	4 is in composites, with value 2. Remove it, insert 4 + 2 = 6 and increment n.
5	`{6:2, 9:3}`	5 isn't in composites, so yield it. Then insert 5² = 25 and increment n.
6	`{6:2, 9:3, 25:5}`	6 is in composites, with value 2. Remove it, insert 6 + 2 = 8 and increment n.
7	`{8:2, 9:3, 25:5}`	7 isn't in composites, so yield it. Then insert 7² = 49 and increment n.
8	`{8:2, 9:3, 25:5, 49:7}`	8 is in composites, with value 2. Remove it, insert 8 + 2 = 10 and increment n.
9	`{9:3, 10:2, 25:5, 49:7}`	9 is in composites, with value 3. Remove it, insert 9 + 3 = 12 and increment n.
10	`{10:2, 12:3, 25:5, 49:7}`	10 is in composites, with value 2. Remove it, and... wait.

12 is already in the list, because it's a multiple of 3. We can't insert it. We'll have to amend the algorithm to account for collisions. If a multiple is already accounted for, keep moving until we find one that isn't in the list.

In this case, add another 2 to 12 to get 14 and insert it. Then increment n.

n	composites
11	`{12:3, 14:2, 25:5, 49:7}`	11 isn't in composites, so yield it, insert 11² = 121, increment n, and continue…

Show me the code

Here's an implementation of the naive algorithm presented above

def sieve():
    composites = {}
    n = 2
    while True:
        factor = composites.pop(n, None)
        if factor:
            q = n + factor
            while composites.has_key(q):
                q += factor
            composites[q] = factor
        else:
            # not there - prime
            composites[n*n] = n
            yield n
        n += 1

This implementation takes 26.8 seconds to generate all primes below 10⁷, close to ¼ the time the best trial division algorithm took.

Why is this method so great?

Using the associative array values to remember which "stream" of multiples each composite comes from means that the array is no bigger than the number of primes we've seen so far
The primes can be yielded as soon they're found instead of waiting until the end
Adding p² when we find a new prime cuts down on collisions but still ensures that we'll keep find all multiples of p, because 2p, 3p, …, (p-1)p will be weeded out as multiples of lower primes.
There's very little wasted work - finding a new prime number takes O(1) operations - just checking that the number isn't in the associative array and adding the square to the array. Many composites will take more work, but an amount proportional to the number of distinct prime factors the number has. For example, 12 has prime factors 2, 2, and 3. We tried to add 12 to the array twice, once as a multiple of 2 and once as a multiple of 3. Fortunately, the number of distinct factors is severely limited. For numbers less than 10⁷, 9699690 has the most distinct prime factors: 2, 3, 5, 7, 11, 13, 17, 19. This sure beats the 446 divisions trial division took to find that 9999991 was prime.

The method already incorporates some of the advantages of the souped-up trial division methods from last time. We only worry about multiples of primes, so there's no need to cut out the composite factors. And when checking to see if n is prime, we never consider prime factors larger than √n. Still, there are some optimizations to make.

That's Odd

In the sample runthrough above, the algorithm checks 4, 6, 8, 10, … for primality, even though no even number larger than 2 are prime. Here's an implementation that avoids that:

def odd_sieve():
   composites = {}
   yield 2
   n = 3
   while True:
       factor = composites.pop(n, None)
       if factor:
           q = n + 2 * factor
           while composites.has_key(q):
               q += 2 * factor
           composites[q] = factor
       else:
           # not there - prime
           composites[n*n] = n
           yield n
       n += 2

This method generates primes less than 10⁷ in 13.4 seconds. This is about half the time it took when we didn't pre-filter the evens. In the trial division case, when we cut out the even numbers, we were saving almost no work - one division per even number, out of potentially dozens or hundreds of divisions being performed. This time, we cut out an associative array lookup and insertion, and most numbers are checked by using only a small number of these operations. Let's see what else we can do.

What about 3?

If skipping the numbers that are divisible by 2 paid off, will skipping those divisible by 3 as well? Probably.

def sixish_sieve():
    composites = {}
    yield 2
    yield 3
    step = 2
    n = 5
    while True:
        factor = composites.pop(n, None)
        if factor:
            q = n + 2 * factor
            while q % 6 == 3 or composites.has_key(q):
                q += 2 * factor
            composites[q] = factor
        else:
            # not there - prime
            composites[n*n] = n
            yield n
        n += step
        step = 6 - step

Now the time to generate primes less than 10⁷ is 11.9 seconds. Again, I think we've hit diminishing returns. We didn't get the 1/3 reduction I'd hoped, probably due to the more complicated "next multiple" calculation.

YAGNI

Things are going pretty well. There's only one thing that bothers me about the latest generator - we're storing too many composites in the associative array. Every time we find a prime number, its square is inserted in the array. Even 9999991² is put in the array, even though we'll never check to see if any number greater than 10⁷ is prime. So, modifying the algorithm to omit storing composites that are too large, we get:

def sixish_sieve_max():
    composites = {}
    yield 2
    yield 3
    step = 2
    n = 5
    while True:
        factor = composites.pop(n, None)
        if factor:
            q = n + 2 * factor
            while q % 6 == 3 or composites.has_key(q):
                q += 2 * factor
            composites[q] = factor
        else:
            # not there - prime
            if n*n <= highest:
                composites[n*n] = n
            yield n
        n += step
        step = 6 - step

This generator takes 10.8 seconds to generate primes below 10⁷ - modest improvement, and one I'd keep anyhow, since the code is barely more complicated than the previous version. However, the real boost, if there is any, is in the memory usage. When sixish_sieve generates primes below 10⁷, the private bytes climb up to 52MB, but sixish_sieve_max stays below 25MB. The advantage continues as the problem set grows - when the upper limit is 2*10⁷, sixish_sieve takes 100MB, but sixish_sieve_max remains at a cool 25MB - I guess that's the difference between storing 1270607 composites and 607.

Conclusion

This was a fun and interesting exercise. Being able to look bad at your old code and say "Boy, that was horrible. I'm glad I'm smarter now," makes me happy. And embarrassed. I enjoyed seeing how applying incremental changes and even rudimentary profiling yielded provably better results, right up until they showed that I probably needed to abandon my current path (trial division) and jump on a new one.

I'm sticking with sixish_sieve_max for now. It's fast enough to meet my current needs, and will likely remain so until "CPU inflation" forces the Project Euler team to increase the size of their problem sets. Of course, maybe by then I'll have a faster processor and I won't care.