Grouping large sparse pandas dataframe with groupby.sum() is very slow

Question

I have pandas dataframe of size (607875, 12294). The data is sparse and looks like:

     ID BB CC DD ...
0   abc 0  0  1  ...
1   bcd 0  0  0  ...
2   abc 0  0  1  ...
...

I converted it to the sparse form by calling

dataframe = dataframe.to_sparse()

Later, I groupped it by ID and sum the row values by

dataframe = dataframe.groupby("ID").sum()

For smaller dataframes it works perfectly well, but for this size, it worked for an hour and did not finish the work. Is there a way to speed it up or get around it? Is there any other sparse methods I can use because the to_sparse method is deprecated.

The size of output dataframe would be (2000, 12294) and look like (if there is no other 1 in abc column):

     ID BB CC DD ...
0   abc 0  0  2  ...
1   bcd 0  0  0  ...
...

I have 32 GB RAM on my PC, so it should be enough.

I would suggest melt but I think the DF may be to large to fit into memory. maybe you can look into `dask` ? — Umar.H, Mar 11 '20 at 08:52
Hmm..., I am not really used to sparse data, but I can imagine how numpy can efficently operates on sparse matrices. But I cannot imagine how `groupby` on a sparse dataframe can be efficient. Could it be an option to apply `groupby` on a dense version? — Serge Ballesta, Mar 11 '20 at 08:56
You might wanna explore [dask](https://docs.dask.org/en/latest/dataframe.html) — Vishnudev Krishnadas, Mar 11 '20 at 08:58
Another option is to try the library datatables even though just for importing, Personally have not used it yet. It seems that it is faster than pandas but does not yet have all the capabilities. It does however have a groupby function. https://towardsdatascience.com/an-overview-of-pythons-datatable-package-5d3a97394ee9 — Kempie, Mar 11 '20 at 09:05
@Serge Ballesta, yes I tried it on dense data and it is still very slow — Maria, Mar 11 '20 at 10:43

Freddy Vandalay · Answer 1 · 2020-03-11T10:52:04.943

Pandas has its limitations I'm afraid and is most efficient with relatively small datasets 100MB - 1GB. If you want to work with pandas only, one workaround would be to read in data from source in chunks which will reduce the dataframe. Or if possible, you can filter out unnecessary columns for your transformation.

Elsewhere, you should checkout frameworks such as PySpark or Hadoop which is more suitable for transformations on larger datasets.

tex94 · Accepted Answer · 2023-02-15T11:08:32.327

Inspired by https://stackoverflow.com/a/50991732/8035867 here is a solution that relies on Sklearn to do a kind of sparse one-hot encoding of the group labels and then uses Scipy to do a dot product of two sparse row matrices.

Edit: Used One-Hot Encoder instead to cope with the situation where there are only two classes in the group by.

from sklearn.preprocessing import OneHotEncoder

def sparse_groupby_sum(df, groupby):
    ohe = OneHotEncoder(sparse_output=True)
    # Get all other columns we are not grouping by
    other_columns = [col for col in df.columns if col != groupby]
    # Get a 607875 x nDistinctIDs matrix in sparse row format with exactly 
    # 1 nonzero entry per row
    onehot = ohe.fit_transform(df[groupby].values.reshape(-1, 1))
    # Transpose it. then convert from sparse column back to sparse row, as 
    # dot products of two sparse row matrices are faster than sparse col with
    # sparse row
    onehot = onehot.T.tocsr()
    # Dot the transposed matrix with the other columns of the df, converted to sparse row 
    # format, then convert the resulting matrix back into a sparse 
    # dataframe with the same column names
    out = pd.DataFrame.sparse.from_spmatrix(
        onehot.dot(df[other_columns].sparse.to_coo().tocsr()), 
        columns=other_columns)
    # Add in the groupby column to this resulting dataframe with the proper class labels
    out[groupby] = ohe.categories_[0]
    # This final groupby sum simply ensures the result is in the format you would expect 
    # for a regular pandas groupby and sum, but you can just return out if this is going to be 
    # a performance penalty. Note in that case that the groupby column may have changed index
    return out.groupby(groupby).sum()

dataframe = sparse_groupby_sum(dataframe, "ID")

Note that for performance purposes you can inline the definition of the onehot variable to the out = line, I've just separated it out here for didactic purposes.

Yonas Kassa · Answer 3 · 2020-03-11T11:42:22.453

0

I know it is counter intuitive, but looping over the columns without calling to sparse is faster. Try the code below.

df1 = df[['id', 'BB']].groupby(by='id').sum()
for i in df.columns[2:]:
    df1[i] = df[['id', i]].groupby(by='id').sum()
    # if you want to save space you can drop df columns after they are added to df1

edited Mar 11 '20 at 11:42

answered Mar 11 '20 at 09:01

Yonas Kassa

3,362
1
18
27

1

Are you sure that the code will work? After the first step `df` will have only two columns. – Maria Mar 11 '20 at 10:39

Grouping large sparse pandas dataframe with groupby.sum() is very slow

3 Answers3