2

My RAM is around 15GB.

I have a 30GB+ data which I read by chunk

df_user_logs = pd.read_csv('../input/user_logs.csv', chunksize=1000000)

and then for each chunk I did memory reduction on it like this

list_of_dfs = []
for chunk in df_user_logs:
  change_datatype(chunk)
  change_datatype_float(chunk)
  list_of_dfs.append(chunk)

I did this according to answers and comments given by Link 1 and Link 2

somehow MemoryError occured when I try to concat the list_of_dfs

df_user_logs = pd.concat(list_of_dfs)

Any solution would be greatly appreciated.

Chia Yi
  • 562
  • 2
  • 7
  • 21
  • 1
    How much memory do you really have? Do you really need to read all this in one shot? Also what does `change_datetype` do here? – EdChum Oct 25 '17 at 10:35
  • Why doing `pd.concat(list_of_dfs)`, the result is equivalent to `pd.read_csv(..)` without the chunksize param. Do you expect a very different result for 2 operations that yields a similar result? – Adonis Oct 25 '17 at 10:42
  • @EdChum change_datatype gets the min and max value of the column and changes the datatype ex: from int64 to int12 to reduce the memory size for the csv – Chia Yi Oct 25 '17 at 10:51
  • @Adonis change_datatype and change_datatype_float will greatly reduce my csv file size by half. – Chia Yi Oct 25 '17 at 10:56
  • Still not answered the question as to how much physical memory you have and if there is a need to load the entire df into memory. Basically if you need to operate on data that can't fit into physical memory then you need to use a different approach such as hdfs or dask – EdChum Oct 25 '17 at 10:58
  • @EdChum just edited my post. My memory is 15GB – Chia Yi Oct 25 '17 at 10:58
  • Well it's not going to work, even if you could load this, any operations on it will result in lots of paging to disk due to temporary data allocations – EdChum Oct 25 '17 at 11:00
  • Think about this, you're making a list of dfs which even if they were reduced to half size, you then call concat to make a final df, you don't have enough memory for this to happen, it will raise a `MemoryError` or be paging to disk like it's going out of fashion, you need to revisit your approach – EdChum Oct 25 '17 at 11:05
  • 1
    May be you neet to try https://dask.pydata.org it allows operate dataframes larger than your RAM in a Pandas-like way. – CrazyElf Oct 25 '17 at 14:45

0 Answers0