12
  1. What is the fastest (or most "Pythonic") way to convert

    x = [False, False, True, True]
    

    into 12? (If there is such a way.)

  2. What if x were instead a numpy.array of bools? Is there a special command for that?

I have a large m-by-n array of booleans, where each n-element row represents a single low-dimensional hash of a high-dimensional feature vector. (In the example above, n = 4.) I would like to know the answer in order to compress my data as much as possible. Thank you.


Edit: Thank you for the responses! Using the following test code,

t = 0
for iter in range(500):
    B = scipy.signbit(scipy.randn(1000,20))
    for b in B:
        t0 = time.clock()
        # test code here
        t1 = time.clock()
        t += (t1-t0)
print t

...here were the runtimes on my Thinkpad laptop:

Of course, I welcome any independent tests that may confirm or refute my data!


Edit: In my answer below, changing int(j) to simply j still works, but runs six times as slow! Then perhaps the other answers would become faster if the bool was casted using int. But I'm too lazy to test everything again.


Edit: liori posted results of independent tests here.

Community
  • 1
  • 1
Steve Tjoa
  • 59,122
  • 18
  • 90
  • 101
  • What's the rule to convert the [False, False, True, True] into 12? –  Oct 31 '10 at 23:34
  • `x[0]` is the LSB, and `x[-1]` is the MSB. – Steve Tjoa Oct 31 '10 at 23:35
  • 2
    Please use `timeit` for testing, it is much less prone to errors. My times: http://pastebin.com/x1FEP9gY – liori Nov 01 '10 at 02:16
  • Thanks for the tests! I don't doubt them at all. I have added them to the post. – Steve Tjoa Nov 01 '10 at 02:31
  • Just something to note - in liori's test, sven2() fails miserably because we are using 1000-bit numbers. Check the results (as in the numbers returned by each function) and you'll see that its result is wrong for that large of a number. – Justin Peel Nov 01 '10 at 05:14
  • @Justing Peel - While you're quite correct in general, unless I'm missing something, the example is using 20-bit numbers, _not_ 1000-bit! There shouldn't be any problem with 20 columns on the numpy-based methods.... – Joe Kington Nov 01 '10 at 14:39
  • I think he means in the tests that liori posted. The vector length is 1000 and is tested over 1000 trials. – Steve Tjoa Nov 01 '10 at 14:45
  • @Steve - Ah, that makes more sense! Thanks, and sorry for the noise! – Joe Kington Nov 01 '10 at 17:01

10 Answers10

11

Taking various ideas from various other answers, here's another way to do it:

sum(1<<i for i, b in enumerate(x) if b)

It is quite fast in my tests - right up with the numpy method for large number of bits even though it overflows like crazy. I used liori's testing module for testing. Steve's method, with the change I suggested, is just barely faster. However, if a lot of these sorts of conversions need to be done at a time (and with not too many bits), I'm betting that numpy will be faster.

Justin Peel
  • 46,722
  • 6
  • 58
  • 80
6

Most Pythonic might be this:

sum(2**i*b for i, b in enumerate(x))

It's hard to tell if it is also the fastest.

In numpy I would use

numpy.sum(2**numpy.arange(len(x))*x)

but this won't be faster for small arrays x, and it won't work for big arrays x since machine size integers are used instead of Pythons arbitrary precision ints.

Sven Marnach
  • 574,206
  • 118
  • 941
  • 841
  • Thanks! For some array sizes, the second solution worked quite well, but for others it didn't. – Steve Tjoa Nov 01 '10 at 00:38
  • @Steve - The other advantage of the numpy solution is that you can avoid iterating through each row. Using the "`B`" array from your test code above: `numpy.sum(2**numpy.arange(B.shape[1])*B, axis=1)`. This should give a large speedup compared with iterating over each row in the array... The full 500x loop executes in less than a second on my machine... – Joe Kington Nov 01 '10 at 02:19
  • 1
    Since numpy doesn't handle large integers the same as Python, you have to be careful with really large numbers. If there will larger numbers, you can get a little more out of this method by doing `dtype=numpy.longlong` in arange(). Also, there is a very, very small speed-up by using the sum method of the resultant numpy array rather than using numpy.sum. – Justin Peel Nov 01 '10 at 05:17
3
reduce(lambda a,b:2*a+b, reversed(x))

You could get rid of reversed() if you had least significant bit at the end of array. This works with numpy.array too, and doesn't need enumerate(). From my tests seem to be faster too: no need to use exponentiation.

liori
  • 40,917
  • 13
  • 78
  • 105
  • Thank you for the elegant solution! I was blown away when I first saw it. Unfortunately, it seems to run the slowest, with or without `reversed`. Might anyone know why? – Steve Tjoa Nov 01 '10 at 00:33
  • @Steve: on my computer it is faster than sum+exponentiation. Funny thing... how long vectors do you use? Do you test performance using `timeit`? – liori Nov 01 '10 at 01:27
2

An elegant, pythonic, always-working way is this:

def powers(x):
    """yield powers of x, starting from x**0 forever"""
    power = 1
    while True:
        yield power
        power *= x

def bools_to_int(bools):
    # in Python 2, use itertools.izip!
    return sum(int(place) * place_weight for place_weight, place in 
               zip(powers(2), bools))

Note that you can get rid of powers (by enumerate and squaring in the comprehension, as other answers do) - but maybe it's clearer this way.

  • Your answer doesn't give the same answer as the others. Substituting `bools` for `reversed(bools)` fixes it. – Justin Peel Nov 01 '10 at 05:07
  • @Justin Peel: Come again? I already noticed that shortly after answering and added `reversed`... –  Nov 01 '10 at 12:53
  • try the code that you have here with the example given by the OP. I get 3 as the answer when it should be 12. You didn't need to put the `reversed` in. – Justin Peel Nov 01 '10 at 15:55
  • *bashes head against wall* @Justin: Yes, you're right and now I realize why. –  Nov 01 '10 at 16:23
2

My initial attempt, just for reference:

def bool2int(x):
    y = 0
    for i,j in enumerate(x):
        if j: y += int(j)<<i
    return y
Steve Tjoa
  • 59,122
  • 18
  • 90
  • 101
1

I was trying ipython %timeit and it seems that doing the following is faster:

y = 0
for i,j in enumerate(x):
    if j: y += 1<<i

In addition, if your boolean vector is a numpy.ndarray, converting it to python array x.tolist() and running the same seems to work faster in this case. It's all marginal, but consistent as well as, at these speeds, marginals add up well.

Broxzier
  • 2,909
  • 17
  • 36
Atreya
  • 11
  • 1
1

numpy has the packbits function for this. It also supports operations along axes:

In [3]: B = scipy.signbit(scipy.randn(1000,8)).astype("i1")

In [3]: B[0]
Out[3]: array([0, 1, 0, 0, 0, 1, 0, 0], dtype=int8)

In [4]: np.packbits(B[0])
Out[4]: array([68], dtype=uint8)

In [5]: %timeit np.packbits(B, axis=1)
10000 loops, best of 3: 37 µs per loop

it works for int8 sizes for larger sizes you have to shift and or

In [8]: x # multiple of 8
Out[8]: array([1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 1, 0, 1, 0, 1], dtype=int8)

In [9]: r = np.packbits(x).astype(np.int32); r
Out[9]: array([171, 129], dtype=uint8)

In [10]: r[0] << 8 | r[1] 
Out[10]: 33237

In [11]: sum(1<<i for i, b in enumerate(x[::-1]) if b)
Out[11]: 33237

if x is no multiple of 8 you have to pad in zeros

jtaylor
  • 2,389
  • 19
  • 19
1

Something like this?

>>> x = [False, False, True, True]
>>> sum([int(y[1])*2**y[0] for y in enumerate(x)])
12

You can convert a numpy array to a regular list using a list() cast.

>>> a = numpy.array([1,2,3,4])
>>> a
array([1, 2, 3, 4])
>>> list(a)
[1, 2, 3, 4]
Emil H
  • 39,840
  • 10
  • 78
  • 97
1

If you have a matrix, you probably want to do it like this:

#precompute powers of two
vals = 2.**np.arange(20)

B = ....
compressed = np.dot(B, vals) # matrix multiplication.

np.dot should be faster than any loop in Python. Much faster.

luispedro
  • 6,934
  • 4
  • 35
  • 45
0

If you're willing to add another extension to the mix, I added pack() and unpack() to the development branch of gmpy. My tests show it may be 2x or 3x faster.

>>> import gmpy2
>>> gmpy2.pack([0,0,1,1],1)
mpz(12)
>>> gmpy2.unpack(12,1)
[mpz(0), mpz(0), mpz(1), mpz(1)]

Disclaimer: The development version is called gmpy2 and can co-exist with the stable version. It is still in alpha phase but will hopefully become beta in a few weeks. You need to have both GMP and MPFR libraries installed. The source is available at http://code.google.com/p/gmpy/source/checkout

casevh
  • 11,093
  • 1
  • 24
  • 35