4

I have a huge lod dataset of around 10 million rows and I have huge problems regarding performance and speed. I tried to use pandas, numpy(also using numba library) and dask. However I wasn't able to acchieve sufficient success.

Raw Data (minimal and simplified)

df = pd.read_csv('data.csv',sep=';', names=['ID', 'UserID'], error_bad_lines=False, 
     encoding='latin-1', dtype='category')

For problem reproduction:

df = pd.DataFrame({'ID': [999974708546523127, 999974708546523127, 999974708546520000], 'UserID': ['AU896', 'ZZ999', 'ZZ999']}, dtype='category')
df

ID                  UserID
999974708546523127  AU896  
999974708546523127  ZZ999
999974708546520000  ZZ999  

Expected Output

User   999974708546520000   999974708546523127
AU896           1                     0            
ZZ999           1                     1    

I am able to acchieve this using the following different scripts. However, on big datasets the scripts are terribly slow. Finally I need to compute a correlation matrix between all users, based on the expexted output. This is the reason for the structure of the output:

Pandas

results_id = pd.crosstab(df.UserID, df.ID, dropna=False)

Numpy and Numba

import numpy as np
import numba

records = df.to_numpy()

unique_id =  np.unique(records[:, 0])
unique_userid = np.unique(records[:, 1])

results_id = np.zeros((len(unique_userid), len(unique_id)))

@numba.jit(nopython=True):
def ID_PreProcess(records, records_tcode, records_user):    
   for userid in range(len(unique_userid)):
        user = np.where(records[:, 1] == unique_userid[userid])

        for id in range(len(unique_id)):
            tcode_row= np.where(records[:, 0] == unique_id[id])
            inter = records[np.where((records[:,1] == id) * (records[:,0] == id))]
            results_id[userid, id]=len(inter)
   return results_id

results_id = ID_PreProcess(records, records_tcode, records_user)

Dask

import pandas as pd
import dask.dataframe as dd


dask_logs = dd.from_pandas(df, npartitions=2)
results_id = dd.concat([dask_logs.UserID ,dd.get_dummies(dask_logs.ID)],axis=1).groupby('UserID').sum().compute()

I hope I can show that I tried multiple different possibilities. However, none of the options is efficient enough for such an amount of rows.

I found this post which seems to be very close to my problem, but I wasn't able to incorporate the solutions to my problem.

Thank you for your much for your help!

Mike_H
  • 1,343
  • 1
  • 14
  • 31
  • How many Unique `ID`s are there? Do you really need them as individual columns... could it make sense to have them under one column called "ID" and repeated against (or part of an index) for the `UserID`... else it's most likely you're going to end up with a very sparse crosstab. – Jon Clements May 03 '19 at 15:37
  • How does `df.groupby(['UserID', 'ID']).size()` work for you (or not?) – Jon Clements May 03 '19 at 15:38
  • Try the above mike... that'll create a multi-level index with size of each grouping... that at least avoids the sparseness and probably more practical to index/retrieve data from as well – Jon Clements May 03 '19 at 15:39
  • Yeah, finally I need to compute a correlation matrix between all users. Can I do it from this result? – Mike_H May 03 '19 at 15:41
  • Your output seems to be perfect. How could I map it into the structure that I have described? I think this can possibly solve my issue! @JonClements – Mike_H May 03 '19 at 15:44
  • I tried to apply ```df.groupby(['UserID', 'ID']).size().reset_index(inplace=True)``` But this didn't work. :/ – Mike_H May 03 '19 at 15:56
  • Do you need it in that structure though? Basically, you've got that structure which you can use indexing on to effectively access columns without it being sparse... could you elaborate on "But this didn't work"? An example of how you want to access/use the data would be useful as well? – Jon Clements May 03 '19 at 16:16
  • Okay it did worked, applying a little correction. Thank you very very much! I will come back on monday with a more elaborated answer. :)) – Mike_H May 03 '19 at 16:21
  • Looks like what you might have been trying to do is: `df.groupby(['UserID', 'ID'], as_index=False).size()` (although keeping it as an index could well make more sense)? Anyway - what did you come up with in the end? – Jon Clements May 03 '19 at 16:22
  • Oh that's fantastic... I look forward to reading your answer on Monday then - have a great weekend. – Jon Clements May 03 '19 at 16:24
  • `crosstab` will be slower than the equivalent pivot_table (or groupby) (because crosstab [uses pivot_table in the background](https://github.com/pandas-dev/pandas/blob/v0.24.2/pandas/core/reshape/pivot.py#L519-L521)). Though, I don't know if there is a substantial gain when the `DataFrame` is large. – ALollz May 03 '19 at 17:16
  • Hey @JonClements. worked perfectly. you might post it as the answer for this issue and add ```pd.pivot```, which finally gave me my expected output. – Mike_H May 06 '19 at 08:14
  • 1
    @Mike_H I'm still not sure exactly what you did re `pd.pivot` so not quite sure what the actual answer is that worked for you... feel free to self-answer :) – Jon Clements May 06 '19 at 11:07
  • Next time reduce the problem to a minimal representation. – JoseOrtiz3 Sep 08 '19 at 03:27

1 Answers1

1

By incorporating helpful comments I came up with a solution on my own.

df.groupby(['UserID', 'ID'], as_index=False).size()
df.reset_index(inplace = True)
df.columns = ['UserID', 'ID', 'Counter']
Expected_Output = pd.pivot('UserID', 'ID', 'Counter')
Mike_H
  • 1,343
  • 1
  • 14
  • 31
  • Thanks for this Mike. I was haunted by the same question. `Dask` is quite flimsy to handle the groupby calculation. – FlyUFalcon Dec 20 '19 at 14:16