I have a large matrix (~200 million rows) describing a list of actions that occurred every day (there are ~10000 possible actions). My final goal is to create a co-occurrence matrix showing which actions happen during the same days.
Here is an example dataset:
data = {'date': ['01', '01', '01', '02','02','03'],
'action': [100, 101, 989855552, 100, 989855552, 777]}
df = pd.DataFrame(data, columns = ['date','action'])
I tried to create a sparse matrix with pd.get_dummies, but unravelling the matrix and using groupby on it is extremely slow, taking 6 minutes for just 5000 rows.
# Create a sparse matrix of dummies
dum = pd.get_dummies(df['action'], sparse = True)
df = df.drop(['action'], axis = 1)
df = pd.concat([df, dum], axis = 1)
# Use groupby to get a single row for each date, showing whether each action occurred.
# The groupby command here is the bottleneck.
cols = list(df.columns)
del cols[0]
df = df.groupby('date')[cols].max()
# Create a co-occurrence matrix by using dot-product of sparse matrices
cooc = df.T.dot(df)
I've also tried:
- getting the dummies in non-sparse format;
- using groupby for aggregation;
- going to sparse format before matrix multiplication.
But I fail in step 1, since there is not enough RAM to create such a large matrix.
I would greatly appreciate your help.