Fastest way to parse (split) binary bits in python

Question

We are counting photons and time-tagging with this FPGA counter.We got about 500MB of data per minutes. I am getting 32bits of data ~~in hex string~~ *32-bit signed integers stored using little-endian byte order. Currently I am doing like:

def getall(file):
    data1 = np.memmap(file, dtype='<i4', mode='r')

    d0=0
    raw_counts=[]
    for i in data1:

        binary = bin(i)[2:].zfill(8)
        decimal = int(binary[5:],2)

        if binary[:1] == '1':
            raw_counts.append(decimal)

    counter=collections.Counter(raw_counts)
    sorted_counts=sorted(counter.items(), key=lambda pair: pair[0], reverse=False)
    return counter,counter.keys(),counter.values()

~~I think this part (binary = bin(i)[2:].zfill(8);decimal = int(binary[5:],2)) is slowing down the process.~~ ( No it is not. I found out by profiling my program.) Is there any way to speed it up? So far I only need the binary bits from [5:]. I don't need all 32bits. So I think the parsing the 32bits to last 27bits is taking much of the time. Thanks,

*Update 1

J.F.Sebastian pointed me it is not in hex string.

*Update 2

Here is the final code if any one needs it. I ended up using np.unique instead of collection counter. At the end , I converted back to collection counter because I want to get accumulative counting.

#http://stackoverflow.com/questions/10741346/numpy-most-efficient-frequency-counts-for-unique-values-in-an-array
def myc(x):
    unique, counts = np.unique(x, return_counts=True)
    return np.asarray((unique, counts)).T


def getallfast(file):
    data1 = np.memmap(file, dtype='<i4', mode='r')
    data2=data1[np.nonzero((~data1 & (31 <<1)))] & 0x7ffffff #See J.F.Sebastian's comment.
    counter=myc(data2)
    raw_counts=dict(zip(counter[:,0],counter[:,1]))
    counter=collections.Counter(raw_counts)

    return counter,counter.keys(),counter.values()

However this one looks like the fastest version for me. data1[np.nonzero((~data1 & (31 <<1)))] & 0x7ffffff is slowing down compared to counting first and convert the data later binary = bin(counter[i,0])[2:].zfill(8)

def myc(x):
    unique, counts = np.unique(x, return_counts=True)
    return np.asarray((unique, counts)).T

def getallfast(file):
    data1 = np.memmap(file, dtype='<i4', mode='r')
    counter=myc(data1)
    xnew=[]
    ynew=[]
    raw_counts=dict()
    for i in range(len(counter)):
        binary = bin(counter[i,0])[2:].zfill(8)
        decimal = int(binary[5:],2)
        xnew.append(decimal)
        ynew.append(counter[i,1])
        raw_counts[decimal]=counter[i,1]


    counter=collections.Counter(raw_counts)
    return counter,xnew,ynew

actually from what I have found converting it to a string is quite performant ... moreso than other methods ... (at least when taking multiple slices) — Joran Beasley, Oct 09 '15 at 18:25
your code implies that the input is not "hex string". Your input contains 32-bit signed integers stored using little-endian byte order. To get the 27 least-significant bits, you could use bitwise operations: `i & 0x7ffffff` (to do it efficiently, use vectorized numpy operations). If you are doing everything right then you task should be I/O bound (limited by the speed of your hard disk where the input files are stored). [`Counter()` is slow on Python 2](http://stackoverflow.com/a/2525617/4279). — jfs, Oct 09 '15 at 21:14
Here's an [example of vectorized bitwise numpy operations](http://stackoverflow.com/a/15916760/4279) — jfs, Oct 09 '15 at 21:22
@J.F.Sebastian You are right. My input is 32-bit signed integers stored using little-endian byte order. I will take a look into vectorized numpy. Thanks — Aung, Oct 09 '15 at 22:35
vectorized numpy operations could be as simple as: `raw_counts = data1[np.nonzero(data1 & (1 << 31))] & 0x7ffffff` — jfs, Oct 15 '15 at 08:00
@J.F.Sebastian Thanks. I also use np.unique which is way faster than collection counter. I updated the code. — Aung, Oct 16 '15 at 00:11

score 0 · Answer 1 · answered Oct 09 '15 at 18:32

0

I guess you could try one of these 2

could just take the bits with binary and fivebits=my_int&0x1f

if you want the five bits at the other end just fivebits = my_int >> (32-5)

but really in my experience converting it to a string is quite fast ... I thought that was a bottle neck many years ago ... after profiling it I found it wasnt

answered Oct 09 '15 at 18:32

Joran Beasley

110,522
12
160
179

2

it looks like OP wants `(32-5)` bits i.e., `my_int & 0x7ffffff` – jfs Oct 09 '15 at 21:24

Fastest way to parse (split) binary bits in python

1 Answers1