I had some time to figure my question out. I do not claim that I have the fastest solution, but for me at least, it appears to be acceptable.
Lets say you have a large dataset loaded as pandas data frame, and you make a selection from that dataset.
selection = (data["Col1"] >= 0.2) & (data["Col2"] >= 0.5)
selection is now a pandas.core.series.Series of boolean values, one for each row that specifies whether that row is included (True) or not included (False). My goal is to efficiently turn that series into a packed string that can be used as a dictionary key.
def genkey(bseries):
line = bseries.values.tobytes()
lst = [line[i:i+8] for i in range(0, len(line), 8)]
#return ''.join([chr(sum([ord(y)<<x for x,y in enumerate(j)])) for j in lst]) # this line is the vast majority of time.
retstr = ''
for j in lst[:-1]:
#retstr += chr((ord(j[0]) << 7) + (ord(j[1]) << 6) + (ord(j[2]) << 5) + (ord(j[3]) << 4) + (ord(j[4]) << 3) + (ord(j[5]) << 2) + (ord(j[6]) << 1) + ord(j[7]))
# in python 3 use
retstr += chr(( j[0] << 7) + (j[1] << 6) + (j[2] << 5) + (j[3] << 4) + (j[4] << 3) + (j[5] << 2) + (j[6] << 1) + j[7])
j = lst[-1] # last value can be less than 8
if len(j) < 8:
j = j + b'\x00' * (8 - len(j) )
retstr += chr(( j[0] << 7) + (j[1] << 6) + (j[2] << 5) + (j[3] << 4) + (j[4] << 3) + (j[5] << 2) + (j[6] << 1) + j[7])
return retstr
genkey(selection) will return a string that I can use.
At the time of this writing I do not know whether my caching mechanism is worth the effort. I do not know whether cache hits will happen frequently enough to justify the cost. I guess I will know in a few days if it seems it is worth it. If you have similar problem - good luck!