I am trying to pre process data for further analysis. First I'm reading the data from a csv file ( x ).
Then I am splitting it up into three parts. Lastly I need to transform one array using get_dummies
, concat
and sum
for the result of groupby
.
import pandas as pd
RawData_v2_clear=pd.read_csv('C:\\Users\\User\\Documents\\top200users_filtered.csv',
sep=';', usecols = ['Username', 'Code', 'Object'], error_bad_lines=False,
encoding='latin-1')
dfU = RawData_v2_clear['Username']
dfT = RawData_v2_clear['Code']
dfO = RawData_v2_clear['Object']
del RawData_v2_clear, dfO (to free up some memory)
df_newT = pd.concat([dfU,pd.get_dummies(dfT)],axis=1)
df_new_gbyT = df_newT.groupby('Username').sum()
Raw_Data_V2_clear
has shape (~11 million rows x 3 columns).
Error:
File "c:\Users\User\Desktop\Faulty_Skript.py", line XXX, in <module>
df_newT = pd.concat([dfU,pd.get_dummies(dfT)],axis=1).sum()
File "C:\Users\User\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\reshape\reshape.py", line 866, in get_dummies
dtype=dtype)
File "C:\Users\User\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\reshape\reshape.py", line 963, in _get_dummies_1d
dummy_mat = np.eye(number_of_cols, dtype=dtype).take(codes, axis=0)
MemoryError
On another system this operations takes some time but finished without Memory Error
. Maybe someone has a good idea to fix this memory issue? Maybe append is more memory friendly than concat? However my implementation of append failed as well on my current system.
Thank you very much!