4

I am trying to pre process data for further analysis. First I'm reading the data from a csv file ( x ).

Then I am splitting it up into three parts. Lastly I need to transform one array using get_dummies, concat and sum for the result of groupby.

import pandas as pd

RawData_v2_clear=pd.read_csv('C:\\Users\\User\\Documents\\top200users_filtered.csv', 
        sep=';', usecols = ['Username', 'Code', 'Object'], error_bad_lines=False, 
        encoding='latin-1')

dfU = RawData_v2_clear['Username']              
dfT = RawData_v2_clear['Code']   
dfO = RawData_v2_clear['Object'] 

del RawData_v2_clear, dfO                               (to free up some memory)

df_newT = pd.concat([dfU,pd.get_dummies(dfT)],axis=1)

df_new_gbyT = df_newT.groupby('Username').sum()

Raw_Data_V2_clear has shape (~11 million rows x 3 columns).

Error:
  File "c:\Users\User\Desktop\Faulty_Skript.py", line XXX, in <module>
    df_newT = pd.concat([dfU,pd.get_dummies(dfT)],axis=1).sum()
  File "C:\Users\User\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\reshape\reshape.py", line 866, in get_dummies
    dtype=dtype)
  File "C:\Users\User\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\reshape\reshape.py", line 963, in _get_dummies_1d
    dummy_mat = np.eye(number_of_cols, dtype=dtype).take(codes, axis=0)
MemoryError

On another system this operations takes some time but finished without Memory Error. Maybe someone has a good idea to fix this memory issue? Maybe append is more memory friendly than concat? However my implementation of append failed as well on my current system.

Thank you very much!

Mike_H
  • 1,343
  • 1
  • 14
  • 31
  • A more general solution would be using Dask instead. I ran into this issue very often while dealing with large datasets. If you're using ipython/jupyter notebook, maybe try resetting kernel to free memory. Also maybe try reading csv in chunks and then concatenating them later? – Yash Nag Apr 18 '19 at 07:01
  • Maybe try casting 'Username' column to `category` dtype..? Could try passing `dtype={'Username': 'category'}` to read_csv method...? – Chris Adams Apr 18 '19 at 07:02
  • @YashNag Yeah I tried Spyder and VS Code via Conda. However both ran into the same issue. It's hard to concatenate them later, because this is my final data I will analyse. It's confusing, that this is running just fine on another less powerful system. Chris: I have 3000 different usernames, so using it as category might not help that much right? – Mike_H Apr 18 '19 at 07:05
  • category type is super memory efficient. If it's type 'object' just now, casting to category will free up a lot of space... encodes as 'int' under-the-hood. Worth a try anyway if you have over 11M rows with only 3000 unique usernames... https://pandas.pydata.org/pandas-docs/stable/user_guide/categorical.html#categorical-memory – Chris Adams Apr 18 '19 at 07:08
  • @ChrisA I tried your previous answer, but I'm stil running into ```Memory Error``` – Mike_H Apr 18 '19 at 08:02
  • 1
    @Mike_H sorry to hear that buddy. It looks like your ultimately trying to create the cross-tab of 'Usernames' and 'codes'...? Try reading in, with `dtype='category'` and then go straight to `df_new_gbyT = pd.crosstab(RawData_v2_clear.Username, RawData_v2_clear.Code)` ..? – Chris Adams Apr 18 '19 at 08:05
  • MVP please post it as the answer :D despite I'm not sure if the result is correct. But time will tell! – Mike_H Apr 18 '19 at 08:08

1 Answers1

2

IIUC, try dtype='category' parameter in read_csv method and pandas.crosstab instead:

import pandas as pd

RawData_v2_clear=pd.read_csv('C:\\Users\\User\\Documents\\top200users_filtered.csv', 
        sep=';', usecols = ['Username', 'Code', 'Object'], error_bad_lines=False, 
        encoding='latin-1', dtype='category')

df_new_gbyT = pd.crosstab(RawData_v2_clear.Username, RawData_v2_clear.Code)
Chris Adams
  • 18,389
  • 4
  • 22
  • 39