6

I have a large matrix (~200 million rows) describing a list of actions that occurred every day (there are ~10000 possible actions). My final goal is to create a co-occurrence matrix showing which actions happen during the same days.

Here is an example dataset:

data = {'date':   ['01', '01', '01', '02','02','03'],
        'action': [100, 101, 989855552, 100, 989855552, 777]}
df = pd.DataFrame(data, columns = ['date','action'])

I tried to create a sparse matrix with pd.get_dummies, but unravelling the matrix and using groupby on it is extremely slow, taking 6 minutes for just 5000 rows.

# Create a sparse matrix of dummies
dum = pd.get_dummies(df['action'], sparse = True)
df = df.drop(['action'], axis = 1)
df = pd.concat([df, dum], axis = 1)

# Use groupby to get a single row for each date, showing whether each action occurred.
# The groupby command here is the bottleneck.
cols = list(df.columns)
del cols[0]
df = df.groupby('date')[cols].max()

# Create a co-occurrence matrix by using dot-product of sparse matrices
cooc = df.T.dot(df)

I've also tried:

  1. getting the dummies in non-sparse format;
  2. using groupby for aggregation;
  3. going to sparse format before matrix multiplication.

But I fail in step 1, since there is not enough RAM to create such a large matrix.

I would greatly appreciate your help.

filbranden
  • 8,522
  • 2
  • 16
  • 32
Dudelstein
  • 383
  • 3
  • 16

2 Answers2

3

I came up with an answer using only sparse matrices based on this post. The code is fast, taking about 10 seconds for 10 million rows (my previous code took 6 minutes for 5000 rows and was not scalable).

The time and memory savings come from working with sparse matrices until the very last step when it is necessary to unravel the (already small) co-occurrence matrix before export.

## Get unique values for date and action
date_c = CategoricalDtype(sorted(df.date.unique()), ordered=True)
action_c = CategoricalDtype(sorted(df.action.unique()), ordered=True)

## Add an auxiliary variable
df['count'] = 1

## Define a sparse matrix
row = df.date.astype(date_c).cat.codes
col = df.action.astype(action_c).cat.codes
sparse_matrix = csr_matrix((df['count'], (row, col)),
                shape=(date_c.categories.size, action_c.categories.size))

## Compute dot product with sparse matrix
cooc_sparse = sparse_matrix.T.dot(sparse_matrix)

## Unravel co-occurrence matrix into dense shape
cooc = pd.DataFrame(cooc_sparse.todense(), 
       index = action_c.categories, columns = action_c.categories)
Dudelstein
  • 383
  • 3
  • 16
2

There are a couple of fairly straightforward simplifications you can consider.

One of them is that you can call max() directly on the GroupBy object, you don't need the fancy index on all columns, since that's what it returns by default:

df = df.groupby('date').max()

Second is that you can disable sorting of the GroupBy. As the Pandas reference for groupby() says:

sort : bool, default True

Sort group keys. Get better performance by turning this off. Note this does not influence the order of observations within each group. Groupby preserves the order of rows within each group.

So try that as well:

df = df.groupby('date', sort=False).max()

Third is you can also use a simple pivot_table() to produce the same result.

df = df.pivot_table(index='date', aggfunc='max')

Yet another approach is going back to your "actions" DataFrame, turning that into a MultiIndex and using it for a simple Series, then using unstack() on it, that should get you the same result, without having to use the get_dummies() step (but not sure whether this will drop some of the sparseness properties you're currently relying on.)

actions_df = pd.DataFrame(data, columns = ['date', 'action'])
actions_index = pd.MultiIndex.from_frame(actions_df, names=['date', ''])
actions_series = pd.Series(1, index=actions_index)
df = actions_series.unstack(fill_value=0)

Your supplied sample DataFrame is quite useful for checking that these are all equivalent and produce the same result, but unfortunately not that great for benchmarking it... I suggest you take a larger dataset (but still smaller than your real data, like 10x smaller or perhaps 40-50x smaller) and then benchmark the operations to check how long they take.

If you're using Jupyter (or another IPython shell), you can use the %timeit command to benchmark an expression.

So you can enter:

%timeit df.groupby('date').max()
%timeit df.groupby('date', sort=False).max()
%timeit df.pivot_table(index='date', aggfunc='max')
%timeit actions_series.unstack(fill_value=0)

And compare results, then scale up and check whether the whole run will complete in an acceptable amount of time.

filbranden
  • 8,522
  • 2
  • 16
  • 32
  • 1
    Update: Getting rid of [cols] in groupby shortened computation time by about a third, while using pivot_table was almost identical time-wise. The MultiIndex approach is incredibly fast with less than 1s compared to 240s with the other approaches. However, it still does not create a sparse matrix and I run into memory errors with unstack() already with 1 million rows. Would it be possible to unstack the series directly into a sparse matrix? Thanks a lot. – Dudelstein Jan 22 '20 at 18:18
  • 1
    @Dudelstein Yes! You can actually make the DataFrame sparse by converting the action series to sparse. Use this line instead: `actions_series = pd.Series(1, index=actions_index).astype(pd.SparseDtype(int))`, then the unstacked df will be sparse as well (you can check with `df.dtypes`.) Can you confirm that fixes it for you? – filbranden Jan 22 '20 at 19:36
  • I'll update my answer to incorporate the observations from your comments... One small clarification, regarding pivot_table, was it the same as shortened groupby (one third of original) or your original one with cols? Did you check whether sorting had any effect? I think these are useful observations to capture in the answer... Thanks for testing my suggestions! – filbranden Jan 22 '20 at 19:39
  • 1
    Using `groupby().max()` as well as `pivot_table()` take about 4 minutes, which is a third faster than my original approach. I have tried adding `astype(pd.SparseDtype(int))` which works great for 100k rows, but when using 1 million rows, I get the error `Unstacked DataFrame is too big, causing int32 overflow` , when using 7.5 million, I receive `index 1632668278 is out of bounds for axis 0 with size 1632663830` I was also considering going a different route posted [here](https://stackoverflow.com/questions/31661604/efficiently-create-sparse-pivot-tables-in-pandas), so far haven't tried it. – Dudelstein Jan 22 '20 at 20:11