Does anyone know a better way of doing column calculations?

Question

Has been rewritten!

Currently I'm trying to make some bitwise overlap calculations, using pandas dataframes. The function I use does work, but it's rather slow, and I would like to speed it up. Unfortunately I don't really have any good ideas of how I can do that.

This is my current function to do so

def get_simple_overlap(dataframe, events_x, events_y):
    df_dict = dict()

    for evt_x, evt_y in product(events_x, events_y):
        overlap = (dataframe[evt_x] & dataframe[evt_y]).tolist()
        total = (dataframe[evt_x] | dataframe[evt_y]).tolist()
        try:
            percentage = sum(overlap) / sum(total)
        except ZeroDivisionError:
            percentage = 0

        if df_dict.get(str(evt_x)) is None:
            df_dict[str(evt_x)] = dict()
        
        df_dict[str(evt_x)][str(evt_y)] = percentage
    
    df = pd.DataFrame(df_dict)

    return df

matrix = pd.DataFrame({
    "evt_x": [0, 1, 0, 1, 1, 1, 0, 1, 0, 1],
     ...
    "evt_y": [0, 1, 1, 1, 1, 1, 1, 1, 0, 1],
     ...
})

event_x = ['evt_x']
event_y = ['evt_y']

overlaps = get_simple_overlap(matrix, event_x, event_y)

This was a simple way of doing it, and it rather slow. It returns a matrix with the columns being all events in event_x and indexes being all events in event_y. So there is a percentage for each evt_x - evt_y pair.

Here I expect the overlap of overlaps['evt_x']['evt_y'] to be 0.75 since there are 8 times where either event have a 1 at the same index, and 6 times where both of them have a 1 at the same index, making it be 6/8.

Since i have hundreds of thousands indexes with multiple hundreds columns, I would like not iterate through the dataframe like this. And instead use some smarter way of doing this.

Hope the rewritten version is explained in a way simpler and clearer way.

Surely you can simplify this? Why is there a timestamp involved? Why so many rows in the matrix? Please make the example as simple as possible. — Matt Hall, Sep 01 '22 at 11:48
Your question needs a minimal reproducible example consisting of sample input, expected output, actual output, and only the relevant code necessary to reproduce the problem. See [How to make good reproducible pandas examples](https://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples) for best practices related to Pandas questions. — itprorh66, Sep 01 '22 at 13:20
Do you need to create the entire overlaps matrix, or can you just generate a specific overlap on the fly? — tcotts, Sep 02 '22 at 20:18
@tcotts When the you call the function, the two event parameters specify which events you want to calculate for. I.E. You can choose only a fraction of all events or choose all of them — Martin Lange, Sep 05 '22 at 05:10

s_pike · Accepted Answer · 2022-09-02T16:11:56.843

This uses dot product to count number of times events occur concurrently. We negate the matrix and use the dot product again to count the number of times no events occur, and subtract that from the total possible number of events to get the number of times at least one event occurs. That gives us the required numerator and denominator.

import pandas as pd
import numpy as np

matrix = pd.DataFrame({
    "evt_w": [0, 1, 0, 1, 0, 0, 0, 1, 0, 1],
    "evt_x": [0, 1, 0, 1, 1, 1, 0, 1, 0, 1],
    "evt_y": [0, 1, 1, 1, 1, 1, 1, 1, 0, 1],
    "evt_z": [0, 0, 1, 0, 0, 0, 1, 0, 0, 0],
})


mat=matrix.to_numpy()
neg_mat=1-mat
sim_events = np.dot(mat.T, mat)
tot_events = neg_mat.shape[0]-np.dot(neg_mat.T, neg_mat)
overlaps_mat = np.divide(sim_events, tot_events, out=np.zeros_like(sim_events, dtype=float), where=tot_events!=0)
overlaps_df = pd.DataFrame(overlaps_mat, index=matrix.columns, columns=matrix.columns)

This gives the output:

        evt_w       evt_x       evt_y   evt_z
evt_w   1.000000    0.666667    0.50    0.00
evt_x   0.666667    1.000000    0.75    0.00
evt_y   0.500000    0.750000    1.00    0.25
evt_z   0.000000    0.000000    0.25    1.00

As this question is about speed I did a %%timeit comparison on my machine:

# dummy data
matrix = pd.DataFrame(np.random.randint(0,2,size=(100,10)))

This method returned 99.4 µs ± 803 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

Using the original get_simple_overlap function as follows:

list1, list2 = zip(*product(matrix.columns,matrix.columns))
overlaps = get_simple_overlap(matrix, list1, list2)
overlaps

gives 1.77 s ± 46.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) for the same matrix.

This would calculate the overlap percentage for the whole matrix, rather than each event. and not each event pair. If I had, say, 10 events in each list, I expect a matrix that is 10x10 and represents a percentage for each event pair — Martin Lange, Sep 02 '22 at 14:26
I can see you edited your answer. I was very excited to try it out. Unfortunately it does not work in my matrix... In the matrix I used i have 117 events, with each 1588202 rows of 1s and 0s. A lot of the overlaps, this function produced, was unfortunately greater than 1. I can link to the dataframe i used to get this result if interested — Martin Lange, Sep 05 '22 at 05:58
Yes, could you link? I've just tried using `matrix = pd.DataFrame(np.random.randint(0,2,size=(1600000,120)))`, which took about 15mins on my laptop, returns a 120x120 matrix, with no cells greater than 1. — s_pike, Sep 05 '22 at 08:59
Now that I tried running it again, this time using my eyes on the initial matrix, and it worked! The issue was that I hadn't set my matrix to binary state, so some numbers where 2 and sometimes even 3, and that's why it produces numbers greater than 1. — Martin Lange, Sep 05 '22 at 09:23

Does anyone know a better way of doing column calculations?

1 Answers1