This time another colleague was trying to work with
gmpy2, a library for
fast arithmetic operations on very large numbers. The specific problem was that
the numbers had to be created from a numpy array with 1,000,000 booleans or more
first, and the method my colleague had come up with wasn’t fast enough.
To beat: num = num.bit_set(i)
As a first attempt, they created a
gmpy2.mpz object and manually set each bit
import gmpy2 def mpz_by_bitset(bool_array): num = gmpy2.mpz(0) for i, bit in enumerate(bool_array[::-1]): if bit: num = num.bit_set(i) return num
Before considering performance on large arrays, let’s test it with a sample of 128 (=2^7) booleans:
import numpy as np bool_array = np.random.randint(0,2,2**7, dtype=bool) %timeit mpz_by_bitset(bool_array)
10.8 µs ± 4.13 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)
This function does what we want, but raises two red flags for me:
- We’re processing every bit individually
- A new copy of the object is made every time we change a bit
Modern processors work on 64 bits at a time, or possibly even more with vector instructions, so only working on a single bit at a time means we’re using only about 1-2% of our processor’s power.
First Suggestion: making a binary string
In a first attempt to do better, we make use of the fact that Python’s integers
can be infinitely large, and can be created from a binary string. We first
manually create the binary string with
''.join(), create a Python int and then
pass that to
def mpz_by_bin_string(bool_array): bin_string = '0b' + ''.join(map(str, bool_array.astype(int))) return gmpy2.mpz(int(bin_string, base=2)) %timeit mpz_by_bin_string(bool_array)
80.2 µs ± 85.4 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
While this may seem more Pythonic, we’re now working with bits as a string instead of as numbers. This isn’t great for performance, and it shows: for our 128-bit example it’s about 8x slower.
Second Suggestion: numpy.packbits
With our first suggestion we are in the right direction though: a number in a
computer is after all just a string of bits in memory. The only thing we want to
do is take this string of bits we have, our boolean array, and let gmpy2 read it
as a number it recognizes. So rather than doing this via a string as an
intermediate step, we should just do this numerically. As
gmpy2 has a method
from_binary, this should be possible.
The problem is that a boolean in Python/numpy is stored as an integer, taking up at least one byte of space (‘00000000’ or ‘00000001’). Simply sticking those behind eachother will give us a number with a lot of zeros in the middle that don’t belong there: we only care about the last bit.
That problem can be solved with
numpy.packbits, which packs an array of
binary values into an array of type
uint8 = 1 byte per value. That gets rid of
all the zeros we don’t want. Now we can use numpy’s
tobytes() to get the
array directly as bytes.
Sadly, that solution fails in practise. While
gmpy2.from_binary() accepts a
bytestring as input, it fails when given the result of our bitpacked array as
bytes. There seems to be some special formatting needed that isn’t documented.
We need to use a Python integer as an intermediate for it to work:
def mpz_by_packbits(bool_array): bytestring = np.packbits(bool_array).tobytes() python_int = int.from_bytes(bytestring, byteorder='big') return gmpy2.mpz(python_int) %timeit mpz_by_packbits(bool_array)
1.54 µs ± 10.2 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)
Now that’s more like it! Even with the added step of using
instead of directly using
gmpy2.from_binary, we’re still over 6x faster. But
does that hold up for larger array sizes?
Automated Comparison with Perfplot
Perfplot is a cool little package that automates the creation of timing
comparison plots like those I manually made in earlier posts. Using it
we can quickly compare the different implementations and see how they scale. As
test input size, I will use only multiples of complete bytes, just to prevent
any possible issues with
import perfplot out = perfplot.bench( setup=lambda n: np.random.randint(0,2,2**n), kernels=[mpz_by_bitset, mpz_by_bin_string, mpz_by_packbits], n_range=[n for n in range(3, 36, 4)], labels=['mpz.bit_set', 'int(bin_string)', 'np.packbits'], xlabel='2^n booleans', equality_check=False, max_time=20, ) out.show(logy=True, logx=False)
The most obvious thing to notice is that the
np.packbits version is clearly
the fastest and scales best with larger sizes. It only takes about 10 seconds
for an array of 2^30 booleans. That’s almost 1 GigaByte of memory for just the
int(bin_string) version scales about the same as
np.packbits, but is
a lot slower overall. I can imagine the translation back and forth from integers
to strings and back into an integer taking up a lot of overhead.
bit_set version doesn’t actually perform too bad up to ~2^17 booleans, or
about 16 KiloBytes. After that it scales a lot more poorly and does worse than
int(bin_string) implementation. Perhaps the local copies that keep
getting made are overflowing the processor’s cache at that point? Either way, it
exceeds a runtime of 100 seconds when processing 2^23 booleans (=one MegaByte).
Iterating over the individual bits and using
bit_set is indeed slower than the
np.packbits, and scales surprisingly poorly for more than
Also: Perfplot is cool!