6

[python 3.5.2, pandas 0.24.1, numpy 1.16.1, scipy 1.2.0]

I have the following pandas dataframes

data_pd
    nrows: 1,032,749,584
    cols: ['mem_id':np.uint32, 'offset':np.uint16 , 'ctype':string, 'code':string]

obsmap_pd
    nrows: 10,887,542
    cols: ['mem_id':np.uint32, 'obs_id':np.uint32]    
             (obs_id has consecutive integers between 0 and obsmap_pd nrows)

varmap_pd
    nrows: 4,596
    cols: ['ctype':string, 'code': string, 'var_id':np.uint16]   
             (var_id has consecutive integers between 0 and varmap_pd nrows)

These are the steps I am running

***
sparse_pd = data_pd.groupby(['mem_id','ctype','code'])['offset'].nunique().reset_index(name='value')
sparse_pd['value'] = sparse_pd['value'].astype(np.uint16)
sparse_pd = pd.merge(pd.merge(sparse_pd, obsmap_pd, on='mem_id', sort=False),
                  varmap_pd, on=['ctype','code'], sort=False)[['obs_id','var_id','value']]
***

The purpose of this is to create a scipy csc_matrix in the next step

mat_csc = csc_matrix((sparse_pd['value'].values*1., (sparse_pd['obs_id'].values,sparse_pd['var_id'].values)), 
                     shape=(obsmap_pd.shape[0],varmap_pd.shape[0]))

The creation of csc_matrix is very fast, but the three lines with the pandas code (between the ***) takes 25.7mins. Any ideas on how this can been speeded up?

ironv
  • 978
  • 10
  • 25
  • 3
    A good strategy is to use join instead of merge and to set_index on your `on` columns. You'll want to set index, join, set indexes join and not all on one line. You can see an example here: https://stackoverflow.com/questions/40860457/improve-pandas-merge-performance perhaps this is even a duplicate? – Brandon Bertelsen Feb 07 '19 at 04:07
  • 3
    Possible duplicate of [Improve Pandas Merge performance](https://stackoverflow.com/questions/40860457/improve-pandas-merge-performance) – Brandon Bertelsen Feb 07 '19 at 04:10
  • 1
    Is it all the merge? How long does the groupby step and astype take? You'd be surprised at how slow `.groupby.nunique` can be for a `DataFrame` with possibly tens or hundreds of millions of groups. – ALollz Feb 07 '19 at 16:50
  • Have you considered using [Dask](https://dask.org/) – Serg Nov 03 '21 at 03:29

1 Answers1

0

A good way to speed up merge is to use join instead:

sparse_pd = sparse_pd.\
    .set_index(['mem_id']).join(obsmap_pd.set_index('mem_id']))
    
sparse_pd = sparse_pd.\
    .reset_index().set_index(['ctype','code']).join(varmap_pd.set_index(['ctype','code'])).loc['obs_id','var_id','value']

Also, consider using the pipe module in pandas (documentation here); you lose some readability but it is probably more optimized.

Adrian Mole
  • 49,934
  • 160
  • 51
  • 83