Concatenating pandas dataframes from pickle vs. from in-memory dictionary - why does in-memory fail?

Question

I have dictionary of dataframes that I want to combine. Ideally, I would do this from memory like this:

values = ['A','B','C']
dats = [dataset[x] for x in values] # get list of dataframes from the dictionary of dataframes "dataset" (causes kernel crash)
dataset_df = pd.concat(dats, sort=False, join='outer', ignore_index=True) # concat datasets

However, this causes a kernel crash, so I have to resort to pickling the dictionary first and retrieving the dataframes one by one, which is a real performance hog:

dats = [get_dataset(x) for x in values] # get_dataset() retrieves one dataframe from disk
dataset_df = pd.concat(dats, sort=False, join='outer', ignore_index=True) # concat datasets

The combined dataset fits in memory alongside the individual dataset. I have confirmed this by adding it to the dictionary of dataframes afterwards. So why the kernel crash?

Does putting dataframes from a dict into a list somehow cause excessive memory usage?

score 2 · Answer 1 · answered Dec 04 '19 at 11:34

2

You can pass a generator expression to concat, like this:

dats = (dataset[x] for x in values)

answered Dec 04 '19 at 11:34

Oleg O

1,005
6
11

This was my initial version, but it also caused a kernel crash. – DataWiz Dec 04 '19 at 11:35
Another workaround is to do the downcasting, if it's possible. See my answer here for details: https://stackoverflow.com/questions/59090572/does-pandas-automatically-skip-rows-do-a-size-limit/59090753 – Oleg O Dec 04 '19 at 11:40
I am thinking about reducing dataframe size as well, but that's a longer term perspective due to the rats tail of subsequent changes. I still wonder why the list comprehension causes a crash if the final combined dataframe easily fits into memory alongside the original dataframes. – DataWiz Dec 04 '19 at 11:46
If the generator also fails as you stated, this is apparently not a problem of the list. Besides, AFAIK you would have raised MemoryError if the table is simply too big. The 'outer' option of concat may play some tricks and pump up the resulting table, though. You can try to start concating 2 tables, or reduce them on the load, and catch the point where it starts to crash. – Oleg O Dec 04 '19 at 11:48
That's the thing, it crashes even when concatenating only two small tables, so I'm actually not convinced it's a memory issue (in addition to the lack of a MemoryError). – DataWiz Dec 04 '19 at 11:53

Concatenating pandas dataframes from pickle vs. from in-memory dictionary - why does in-memory fail?

1 Answers1