Memory error while concatenating list of processed dataframes

Question

My RAM is around 15GB.

I have a 30GB+ data which I read by chunk

df_user_logs = pd.read_csv('../input/user_logs.csv', chunksize=1000000)

and then for each chunk I did memory reduction on it like this

list_of_dfs = []
for chunk in df_user_logs:
  change_datatype(chunk)
  change_datatype_float(chunk)
  list_of_dfs.append(chunk)

I did this according to answers and comments given by Link 1 and Link 2

somehow MemoryError occured when I try to concat the list_of_dfs

df_user_logs = pd.concat(list_of_dfs)

Any solution would be greatly appreciated.

How much memory do you really have? Do you really need to read all this in one shot? Also what does `change_datetype` do here? — EdChum, Oct 25 '17 at 10:35
Why doing `pd.concat(list_of_dfs)`, the result is equivalent to `pd.read_csv(..)` without the chunksize param. Do you expect a very different result for 2 operations that yields a similar result? — Adonis, Oct 25 '17 at 10:42
@EdChum change_datatype gets the min and max value of the column and changes the datatype ex: from int64 to int12 to reduce the memory size for the csv — Chia Yi, Oct 25 '17 at 10:51
@Adonis change_datatype and change_datatype_float will greatly reduce my csv file size by half. — Chia Yi, Oct 25 '17 at 10:56
Still not answered the question as to how much physical memory you have and if there is a need to load the entire df into memory. Basically if you need to operate on data that can't fit into physical memory then you need to use a different approach such as hdfs or dask — EdChum, Oct 25 '17 at 10:58
Well it's not going to work, even if you could load this, any operations on it will result in lots of paging to disk due to temporary data allocations — EdChum, Oct 25 '17 at 11:00
Think about this, you're making a list of dfs which even if they were reduced to half size, you then call concat to make a final df, you don't have enough memory for this to happen, it will raise a `MemoryError` or be paging to disk like it's going out of fashion, you need to revisit your approach — EdChum, Oct 25 '17 at 11:05
May be you neet to try https://dask.pydata.org it allows operate dataframes larger than your RAM in a Pandas-like way. — CrazyElf, Oct 25 '17 at 14:45

Memory error while concatenating list of processed dataframes

0 Answers0