speeding up agg and join on a billion-record pandas table

Question

[python 3.5.2, pandas 0.24.1, numpy 1.16.1, scipy 1.2.0]

I have the following pandas dataframes

data_pd
    nrows: 1,032,749,584
    cols: ['mem_id':np.uint32, 'offset':np.uint16 , 'ctype':string, 'code':string]

obsmap_pd
    nrows: 10,887,542
    cols: ['mem_id':np.uint32, 'obs_id':np.uint32]    
             (obs_id has consecutive integers between 0 and obsmap_pd nrows)

varmap_pd
    nrows: 4,596
    cols: ['ctype':string, 'code': string, 'var_id':np.uint16]   
             (var_id has consecutive integers between 0 and varmap_pd nrows)

These are the steps I am running

***
sparse_pd = data_pd.groupby(['mem_id','ctype','code'])['offset'].nunique().reset_index(name='value')
sparse_pd['value'] = sparse_pd['value'].astype(np.uint16)
sparse_pd = pd.merge(pd.merge(sparse_pd, obsmap_pd, on='mem_id', sort=False),
                  varmap_pd, on=['ctype','code'], sort=False)[['obs_id','var_id','value']]
***

The purpose of this is to create a scipy csc_matrix in the next step

mat_csc = csc_matrix((sparse_pd['value'].values*1., (sparse_pd['obs_id'].values,sparse_pd['var_id'].values)), 
                     shape=(obsmap_pd.shape[0],varmap_pd.shape[0]))

The creation of csc_matrix is very fast, but the three lines with the pandas code (between the ***) takes 25.7mins. Any ideas on how this can been speeded up?

A good strategy is to use join instead of merge and to set_index on your `on` columns. You'll want to set index, join, set indexes join and not all on one line. You can see an example here: https://stackoverflow.com/questions/40860457/improve-pandas-merge-performance perhaps this is even a duplicate? — Brandon Bertelsen, Feb 07 '19 at 04:07
Possible duplicate of [Improve Pandas Merge performance](https://stackoverflow.com/questions/40860457/improve-pandas-merge-performance) — Brandon Bertelsen, Feb 07 '19 at 04:10
Is it all the merge? How long does the groupby step and astype take? You'd be surprised at how slow `.groupby.nunique` can be for a `DataFrame` with possibly tens or hundreds of millions of groups. — ALollz, Feb 07 '19 at 16:50

score 0 · Answer 1 · edited Aug 27 '21 at 08:26

A good way to speed up merge is to use join instead:

sparse_pd = sparse_pd.\
    .set_index(['mem_id']).join(obsmap_pd.set_index('mem_id']))
    
sparse_pd = sparse_pd.\
    .reset_index().set_index(['ctype','code']).join(varmap_pd.set_index(['ctype','code'])).loc['obs_id','var_id','value']

Also, consider using the pipe module in pandas (documentation here); you lose some readability but it is probably more optimized.

speeding up agg and join on a billion-record pandas table

1 Answers1