Efficiently convert pandas.Series of boolean to string in Python

Question

I have a Genetic Algorithm that operates on a large dataframe. Each individual evolves a filter and then performs a bunch of calculations on the filtered data. I think that a cache of results may improve my performance.

Filtering a dataframe (e.g. “Col A” > 12.3) returns a pandas Series of boolean values of length the same as my data (around 100K entries). The actual size of the result of the calculation is negligible.

I need ideas in how to turn that series of booleans into a bitfield string that I can stick into a dictionary. Any other key could work as well, but space is a consideration. There are some packages that I am going to try out, bitstruct and bitarray, but since I am not all that experienced in numpy and pandas I wanted to ask in case I am missing something obvious.

Thanks in advance.

What do you mean by *bitfield string*? Can you give example of *bitfield string*? — Daweo, Jun 22 '21 at 13:41
In C I would use a char array with each char storing 8 booleans. Those values, but as a string in python. — gregd128, Jun 22 '21 at 13:49
Hi and welcome on SO. It will be great if you can have a look at [how-to-ask](/help/how-to-ask) and then try to produce a [mcve](/help/mcve). — rpanai, Jun 22 '21 at 15:12

score 0 · Answer 1 · answered Jun 26 '21 at 15:25

I had some time to figure my question out. I do not claim that I have the fastest solution, but for me at least, it appears to be acceptable.

Lets say you have a large dataset loaded as pandas data frame, and you make a selection from that dataset.

selection = (data["Col1"] >= 0.2) & (data["Col2"] >= 0.5)

selection is now a pandas.core.series.Series of boolean values, one for each row that specifies whether that row is included (True) or not included (False). My goal is to efficiently turn that series into a packed string that can be used as a dictionary key.

def genkey(bseries):
    line = bseries.values.tobytes()
    lst = [line[i:i+8] for i in range(0, len(line), 8)]
    #return ''.join([chr(sum([ord(y)<<x for x,y in enumerate(j)])) for j in lst]) # this line is the vast majority of time.
    retstr = ''
    for j in lst[:-1]:
        #retstr += chr((ord(j[0]) << 7) + (ord(j[1]) << 6) + (ord(j[2]) << 5) + (ord(j[3]) << 4) + (ord(j[4]) << 3) + (ord(j[5]) << 2) + (ord(j[6]) << 1) + ord(j[7]))
        # in python 3 use
        retstr += chr(( j[0] << 7) + (j[1] << 6) + (j[2] << 5) + (j[3] << 4) + (j[4] << 3) + (j[5] << 2) + (j[6] << 1) + j[7])
    j = lst[-1] # last value can be less than 8
    if len(j) < 8:
        j = j + b'\x00' * (8 - len(j) )
    retstr += chr(( j[0] << 7) + (j[1] << 6) + (j[2] << 5) + (j[3] << 4) + (j[4] << 3) + (j[5] << 2) + (j[6] << 1) + j[7])
    return retstr

genkey(selection) will return a string that I can use.

At the time of this writing I do not know whether my caching mechanism is worth the effort. I do not know whether cache hits will happen frequently enough to justify the cost. I guess I will know in a few days if it seems it is worth it. If you have similar problem - good luck!

I doubt anyone is losing sleep over this, but looks like i am getting between double to triple performance. Simply by caching answers into a dictionary and having a bit of code to manage that. Yay! — gregd128, Jun 27 '21 at 17:04

Efficiently convert pandas.Series of boolean to string in Python

1 Answers1