3

I'm trying to convert from a numpy array of signs (i.e., a numpy array whose entries are either 1. or -1.) to an integer and back through a binary representation. I have something that works, but it's not Pythonic, and I expect it'll be slow.

def sign2int(s):
    s[s==-1.] = 0.
    bstr = ''
    for i in range(len(s)):
        bstr = bstr + str(int(s[i]))
    return int(bstr, 2)

def int2sign(i, m):
    bstr = bin(i)[2:].zfill(m)
    s = []
    for d in bstr:
        s.append(float(d))
    s = np.array(s)
    s[s==0.] = -1.
    return s

Then

>>> m = 4
>>> s0 = np.array([1., -1., 1., 1.])
>>> i = sign2int(s0)
>>> print i
11
>>> s = int2sign(i, m)
>>> print s
[ 1. -1.  1.  1.]

I'm concerned about (1) the for loops in each and (2) having to build an intermediate representation as a string.

Ultimately, I will want something that works with a 2-d numpy array, too---e.g.,

>>> s = np.array([[1., -1., 1.], [1., 1., 1.]])
>>> print sign2int(s)
[5, 7]
user1416125
  • 133
  • 4
  • Have you tried it on a *real* dataset? How large is it? – wwii Aug 27 '16 at 00:57
  • The largest datasets I expect to see will have sign arrays with ~1000 elements, but the number of sign arrays may be in the billions---a very tall matrix. @wwii – user1416125 Aug 27 '16 at 01:13
  • 1
    Now that you mention it, I believe this will only work if the sign arrays have at most 64 elements. @wwii – user1416125 Aug 27 '16 at 01:16
  • In other words - ~1000 bit integers? – wwii Aug 27 '16 at 01:23
  • Right. :) There will be many data sets with fewer than 100 elements per sign vector. But I might try to use multiple 64-bit ints. The answers below have already helped a lot. Thanks @wwii! – user1416125 Aug 27 '16 at 01:26
  • One option to convert from sign to binary is to scale and shift i.e. if `signArray` is a numpy array of signs then `(signArray+1)/2` will be an array of binary digits. – kabdulla Aug 27 '16 at 01:53
  • Be careful with your implementation of `sign2int`. It modifies the argument in-place, so after calling `i = sign2int(s0)`, `s0` has been modified--all -1 values have been changed to 0. – Warren Weckesser Aug 27 '16 at 01:56

5 Answers5

1

For 1d arrays you can use this one linear Numpythonic approach, using np.packbits:

>>> np.packbits(np.pad((s0+1).astype(bool).astype(int), (8-s0.size, 0), 'constant'))
array([11], dtype=uint8)

And for reversing:

>>> unpack = (np.unpackbits(np.array([11], dtype=np.uint8))[-4:]).astype(float)
>>> unpack[unpack==0] = -1
>>> unpack
array([ 1., -1.,  1.,  1.])

And for 2d array:

>>> x, y = s.shape
>>> np.packbits(np.pad((s+1).astype(bool).astype(int), (8-y, 0), 'constant')[-2:])
array([5, 7], dtype=uint8)

And for reversing:

>>> unpack = (np.unpackbits(np.array([5, 7], dtype='uint8'))).astype(float).reshape(x, 8)[:,-y:]
>>> unpack[unpack==0] = -1
>>> unpack
array([[ 1., -1.,  1.],
       [ 1.,  1.,  1.]])
Mazdak
  • 105,000
  • 18
  • 159
  • 188
  • Thanks @Kasramvd ! I was not aware of packbits and unpackbits. Seems very useful. – user1416125 Aug 27 '16 at 01:19
  • re: [your deleted answer on another question (with benchmarks for list swapping)](http://stackoverflow.com/a/39168029/224132). You should undelete it, and turn it into an answer about the benchmark. It's interesting that your idea ran so much slower, and that one of them ran so much faster. IDK a lot about python, but it seems like other similar list-manipulation problems might have similar choices of solutions with similar relative performance, so it's potentially a useful thing to point out. – Peter Cordes Aug 27 '16 at 03:37
  • @PeterCordes Actually, by that time I was pretty tired (~20h awake) And I just wanted to get ride of that terrible mistake, but now I think still there is no need for my solution because it's pretty obvious that my code will take longer than the rest, I don't know why on earth I used 2 `for` loop (which means more unpacking, complexity, calls, stack jobs, etc. ), while I've wrote in my answer that other ones are using multiple iteration, slicing, etc. and now I think the only interesting thing might be a benchmark between those three answers. – Mazdak Aug 27 '16 at 12:56
  • Can you use those to numpy methods to work with integers that have more than eight bits? – wwii Aug 27 '16 at 21:04
  • @wwii In that case your answer would work very well! – Mazdak Aug 27 '16 at 22:02
  • I was trying to figure out if you could parcel up a bigger number into smaller chunks and operate on them with packbits and vectorize the whole thing.. – wwii Aug 27 '16 at 22:20
  • @wwii Yeah, I thought about that but doesn't seem optimized, I think also using `array.view()` to create a view of the array with another type might be helpful but still we need some extra calculations, and it won't be as optimized as the mathematical approach that you've used. – Mazdak Aug 27 '16 at 22:29
1

I'll start with sig2int.. Convert from a sign representation to binary

>>> a
array([ 1., -1.,  1., -1.])
>>> (a + 1) / 2
array([ 1.,  0.,  1.,  0.])
>>> 

Then you can simply create an array of powers of two, multiply it by the binary and sum.

>>> powers = np.arange(a.shape[-1])[::-1]
>>> np.power(2, powers)
array([8, 4, 2, 1])
>>> a = (a + 1) / 2
>>> powers = np.power(2, powers)
>>> a * powers
array([ 8.,  0.,  2.,  0.])
>>> np.sum(a * powers)
10.0
>>> 

Then make it operate on rows by adding axis information and rely on broadcasting.

def sign2int(a):
    # powers of two
    powers = np.arange(a.shape[-1])[::-1]
    np.power(2, powers, powers)
    # sign to "binary" - add one and divide by two
    np.add(a, 1, a)
    np.divide(a, 2, a)
    # scale by powers of two and sum
    np.multiply(a, powers, a)
    return np.sum(a, axis = -1)
>>> b = np.array([a, a, a, a, a])
>>> sign2int(b)
array([ 11.,  11.,  11.,  11.,  11.])
>>> 

I tried it on a 4 by 100 bit array and it seemed fast

>>> a = a.repeat(100)
>>> b = np.array([a, a, a, a, a])
>>> b
array([[ 1.,  1.,  1., ...,  1.,  1.,  1.],
       [ 1.,  1.,  1., ...,  1.,  1.,  1.],
       [ 1.,  1.,  1., ...,  1.,  1.,  1.],
       [ 1.,  1.,  1., ...,  1.,  1.,  1.],
       [ 1.,  1.,  1., ...,  1.,  1.,  1.]])
>>> sign2int(b)
array([  2.58224988e+120,   2.58224988e+120,   2.58224988e+120,
         2.58224988e+120,   2.58224988e+120])
>>> 

I'll add the reverse if i can figure it. - the best I could do relies on some plain Python without any numpy vectoriztion magic and I haven't figured how to make it work with a sequence of ints other than to iterate over them and convert them one at a time - but the time still seems acceptable.

def foo(n):
    '''yields bits in increasing powers of two

    bit sequence from lsb --> msb
    '''
    while n > 0:
        n, r = divmod(n, 2)
        yield r

def int2sign(n):
    n = int(n)
    a = np.fromiter(foo(n), dtype = np.int8, count = n.bit_length())
    np.multiply(a, 2, a)
    np.subtract(a, 1, a)
    return a[::-1]

Works on 1324:

>>> bin(1324)
'0b10100101100'
>>> a = int2sign(1324)
>>> a
array([ 1, -1,  1, -1, -1,  1, -1,  1,  1, -1, -1], dtype=int8)

Seems to work with 1.2e305:

>>> n = int(1.2e305)
>>> n.bit_length()
1014
>>> a = int2sign(n)
>>> a.shape
(1014,)

>>> s = bin(n)
>>> s = s[2:]
>>> all(2 * int(x) -1 == y for x, y in zip(s, a))
True
>>>
wwii
  • 23,232
  • 7
  • 37
  • 77
0

Here are some vectorized versions of your functions:

def sign2int(s):
    return int(''.join(np.where(s == -1., 0, s).astype(int).astype(str)), 2)

def int2sign(i, m):
    tmp = np.array(list(bin(i)[2:].zfill(m)))
    return np.where(tmp == "0", "-1", tmp).astype(int)

s0 = np.array([1., -1., 1., 1.])

sign2int(s0)
# 11

int2sign(11, 5)
# array([-1,  1, -1,  1,  1])

To use your functions on 2-d arrays, you can use map function:

s = np.array([[1., -1., 1.], [1., 1., 1.]])

map(sign2int, s)
# [5, 7]

map(lambda x: int2sign(x, 4), [5, 7])
# [array([-1,  1, -1,  1]), array([-1,  1,  1,  1])]
Psidom
  • 209,562
  • 33
  • 339
  • 356
0

After a bit of testing, the Numpythonic approach of @wwii that doesn't use strings seems to fit what I need best. For the int2sign, I used a for-loop over the exponents with a standard algorithm for the conversion---which will have at most 64 iterations for 64-bit integers. Numpy's broadcasting happens across each integer very efficiently.

packbits and unpackbits are restricted to 8-bit integers; otherwise, I suspect that would've been the best (though I didn't try).

Here are the specific implementations I tested that follow the suggestions in the other answers (thanks to everyone!):

def _sign2int_str(s):
    return int(''.join(np.where(s == -1., 0, s).astype(int).astype(str)), 2)

def sign2int_str(s):
    return np.array(map(_sign2int_str, s))

def _int2sign_str(i, m):
    tmp = np.array(list(bin(i)[2:])).astype(int)
    return np.pad(np.where(tmp == 0, -1, tmp), (m - len(tmp), 0), "constant", constant_values = -1)

def int2sign_str(i,m):
    return np.array(map(lambda x: _int2sign_str(x, m), i.astype(int).tolist())).transpose()

def sign2int_np(s):
    p = np.arange(s.shape[-1])[::-1]
    s = s + 1
    return np.sum(np.power(s, p), axis = -1).astype(int)

def int2sign_np(i,m):
    N = i.shape[-1]
    S = np.zeros((m, N))
    for k in range(m):
        b = np.power(2, m - 1 - k).astype(int)
        S[k,:] = np.divide(i.astype(int), b).astype(float)
        i = np.mod(i, b)        
    S[S==0.] = -1.
    return S

And here is my test:

X = np.sign(np.random.normal(size=(5000, 20)))
N = 100

t = time.time()
for i in range(N):
    S = sign2int_np(X)
print 'sign2int_np: \t{:10.8f} sec'.format((time.time() - t)/N)

t = time.time()
for i in range(N):
    S = sign2int_str(X)
print 'sign2int_str: \t{:10.8f} sec'.format((time.time() - t)/N)

m = 20
S = np.random.randint(0, high=np.power(2,m), size=(5000,))

t = time.time()
for i in range(N):
    X = int2sign_np(S, m)
print 'int2sign_np: \t{:10.8f} sec'.format((time.time() - t)/N)

t = time.time()
for i in range(N):
    X = int2sign_str(S, m)
print 'int2sign_str: \t{:10.8f} sec'.format((time.time() - t)/N)

This produced the following results:

sign2int_np:    0.00165325 sec
sign2int_str:   0.04121902 sec
int2sign_np:    0.00318024 sec
int2sign_str:   0.24846984 sec
user1416125
  • 133
  • 4
0

I think numpy.packbits is worth another look. Given a real-valued sign array a, you can use numpy.packbits(a > 0). Decompression is done by numpy.unpackbits. This implicitly flattens multi-dimensional arrays so you'll need to reshape after unpackbits if you have a multi-dimensional array.

Note that you can combine bit packing with conventional compression (e.g., zlib or lzma). If there is a pattern or bias to your data, you may get a useful compression factor, but for unbiased random data, you'll typically see a moderate size increase.

Jed
  • 1,651
  • 17
  • 26
  • Thanks, @Jed! To give some context, I'm estimating a probability mass function on corners of a hypercube through sampling, where each sample is a corner---and each corner can be represented as a bitstring. My current approach is first to convert each sample to an int, and then call `numpy.unique` on the set of integers to get counts. In this context, `packbits` is limited by `uint8` (as far as I can tell). So I couldn't go higher than an 8-d cube without careful parsing of `packbits` output. The simple division algorithm gets me to 64-bit ints, which is enough for now. – user1416125 Aug 27 '16 at 20:24
  • @user1416125 Use `numpy.unique` on tuples for each row, cf. http://stackoverflow.com/questions/31097247/remove-duplicate-rows-of-a-numpy-array so you don't have a silly limitation to 64 dimensions. It's also possible to check uniqueness (probabilistically) using a hash or bloom filter. – Jed Aug 27 '16 at 21:14