[python 3.5.2, pandas 0.24.1, numpy 1.16.1, scipy 1.2.0]
I have the following pandas dataframes
data_pd
nrows: 1,032,749,584
cols: ['mem_id':np.uint32, 'offset':np.uint16 , 'ctype':string, 'code':string]
obsmap_pd
nrows: 10,887,542
cols: ['mem_id':np.uint32, 'obs_id':np.uint32]
(obs_id has consecutive integers between 0 and obsmap_pd nrows)
varmap_pd
nrows: 4,596
cols: ['ctype':string, 'code': string, 'var_id':np.uint16]
(var_id has consecutive integers between 0 and varmap_pd nrows)
These are the steps I am running
***
sparse_pd = data_pd.groupby(['mem_id','ctype','code'])['offset'].nunique().reset_index(name='value')
sparse_pd['value'] = sparse_pd['value'].astype(np.uint16)
sparse_pd = pd.merge(pd.merge(sparse_pd, obsmap_pd, on='mem_id', sort=False),
varmap_pd, on=['ctype','code'], sort=False)[['obs_id','var_id','value']]
***
The purpose of this is to create a scipy csc_matrix in the next step
mat_csc = csc_matrix((sparse_pd['value'].values*1., (sparse_pd['obs_id'].values,sparse_pd['var_id'].values)),
shape=(obsmap_pd.shape[0],varmap_pd.shape[0]))
The creation of csc_matrix is very fast, but the three lines with the pandas code (between the ***) takes 25.7mins. Any ideas on how this can been speeded up?