This time another colleague was trying to work with gmpy2, a library for fast arithmetic operations on very large numbers. The specific problem was that the numbers had to be created from a numpy array with 1,000,000 booleans or more first, and the method my colleague had come up with wasn’t fast enough.

To beat: num = num.bit_set(i)

As a first attempt, they created a gmpy2.mpz object and manually set each bit with the bit_set() method.

import gmpy2

def mpz_by_bitset(bool_array):
    num = gmpy2.mpz(0)
    for i, bit in enumerate(bool_array[::-1]):
        if bit:
            num = num.bit_set(i)
    return num

Before considering performance on large arrays, let’s test it with a sample of 128 (=2^7) booleans:

import numpy as np
bool_array = np.random.randint(0,2,2**7, dtype=bool)
%timeit mpz_by_bitset(bool_array)

10.8 µs ± 4.13 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)

This function does what we want, but raises two red flags for me:

  • We’re processing every bit individually
  • A new copy of the object is made every time we change a bit

Modern processors work on 64 bits at a time, or possibly even more with vector instructions, so only working on a single bit at a time means we’re using only about 1-2% of our processor’s power.

First Suggestion: making a binary string

In a first attempt to do better, we make use of the fact that Python’s integers can be infinitely large, and can be created from a binary string. We first manually create the binary string with ''.join(), create a Python int and then pass that to gmpy2:

def mpz_by_bin_string(bool_array):
    bin_string = '0b' + ''.join(map(str, bool_array.astype(int)))    
    return gmpy2.mpz(int(bin_string, base=2))
    
%timeit mpz_by_bin_string(bool_array)

80.2 µs ± 85.4 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

While this may seem more Pythonic, we’re now working with bits as a string instead of as numbers. This isn’t great for performance, and it shows: for our 128-bit example it’s about 8x slower.

Second Suggestion: numpy.packbits

With our first suggestion we are in the right direction though: a number in a computer is after all just a string of bits in memory. The only thing we want to do is take this string of bits we have, our boolean array, and let gmpy2 read it as a number it recognizes. So rather than doing this via a string as an intermediate step, we should just do this numerically. As gmpy2 has a method from_binary, this should be possible.

The problem is that a boolean in Python/numpy is stored as an integer, taking up at least one byte of space (‘00000000’ or ‘00000001’). Simply sticking those behind eachother will give us a number with a lot of zeros in the middle that don’t belong there: we only care about the last bit.

That problem can be solved with numpy.packbits, which packs an array of binary values into an array of type uint8 = 1 byte per value. That gets rid of all the zeros we don’t want. Now we can use numpy’s tobytes() to get the array directly as bytes.

Sadly, that solution fails in practise. While gmpy2.from_binary() accepts a bytestring as input, it fails when given the result of our bitpacked array as bytes. There seems to be some special formatting needed that isn’t documented. We need to use a Python integer as an intermediate for it to work:

def mpz_by_packbits(bool_array):
    bytestring = np.packbits(bool_array).tobytes()
    python_int = int.from_bytes(bytestring, byteorder='big')
    return gmpy2.mpz(python_int)

%timeit mpz_by_packbits(bool_array)

1.54 µs ± 10.2 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)

Now that’s more like it! Even with the added step of using int.from_bytes instead of directly using gmpy2.from_binary, we’re still over 6x faster. But does that hold up for larger array sizes?

Automated Comparison with Perfplot

Perfplot is a cool little package that automates the creation of timing comparison plots like those I manually made in earlier posts. Using it we can quickly compare the different implementations and see how they scale. As test input size, I will use only multiples of complete bytes, just to prevent any possible issues with numpy.packbits.

import perfplot

out = perfplot.bench(
    setup=lambda n: np.random.randint(0,2,2**n),
    kernels=[mpz_by_bitset, mpz_by_bin_string, mpz_by_packbits],
    n_range=[n for n in range(3, 36, 4)],
    labels=['mpz.bit_set', 'int(bin_string)', 'np.packbits'],
    xlabel='2^n booleans',
    equality_check=False,
    max_time=20,
)

out.show(logy=True, logx=False)

The most obvious thing to notice is that the np.packbits version is clearly the fastest and scales best with larger sizes. It only takes about 10 seconds for an array of 2^30 booleans. That’s almost 1 GigaByte of memory for just the final bits!

The int(bin_string) version scales about the same as np.packbits, but is a lot slower overall. I can imagine the translation back and forth from integers to strings and back into an integer taking up a lot of overhead.

The bit_set version doesn’t actually perform too bad up to ~2^17 booleans, or about 16 KiloBytes. After that it scales a lot more poorly and does worse than even the int(bin_string) implementation. Perhaps the local copies that keep getting made are overflowing the processor’s cache at that point? Either way, it exceeds a runtime of 100 seconds when processing 2^23 booleans (=one MegaByte).

Conclusion

Iterating over the individual bits and using bit_set is indeed slower than the implementation using np.packbits, and scales surprisingly poorly for more than 2^17 booleans.

Also: Perfplot is cool!