Has been rewritten!
Currently I'm trying to make some bitwise overlap calculations, using pandas dataframes. The function I use does work, but it's rather slow, and I would like to speed it up. Unfortunately I don't really have any good ideas of how I can do that.
This is my current function to do so
def get_simple_overlap(dataframe, events_x, events_y):
df_dict = dict()
for evt_x, evt_y in product(events_x, events_y):
overlap = (dataframe[evt_x] & dataframe[evt_y]).tolist()
total = (dataframe[evt_x] | dataframe[evt_y]).tolist()
try:
percentage = sum(overlap) / sum(total)
except ZeroDivisionError:
percentage = 0
if df_dict.get(str(evt_x)) is None:
df_dict[str(evt_x)] = dict()
df_dict[str(evt_x)][str(evt_y)] = percentage
df = pd.DataFrame(df_dict)
return df
matrix = pd.DataFrame({
"evt_x": [0, 1, 0, 1, 1, 1, 0, 1, 0, 1],
...
"evt_y": [0, 1, 1, 1, 1, 1, 1, 1, 0, 1],
...
})
event_x = ['evt_x']
event_y = ['evt_y']
overlaps = get_simple_overlap(matrix, event_x, event_y)
This was a simple way of doing it, and it rather slow. It returns a matrix with the columns being all events in event_x and indexes being all events in event_y. So there is a percentage for each evt_x - evt_y pair.
Here I expect the overlap of overlaps['evt_x']['evt_y']
to be 0.75 since there are 8 times where either event have a 1 at the same index, and 6 times where both of them have a 1 at the same index, making it be 6/8.
Since i have hundreds of thousands indexes with multiple hundreds columns, I would like not iterate through the dataframe like this. And instead use some smarter way of doing this.
Hope the rewritten version is explained in a way simpler and clearer way.