Pandas dataframes too large to append to dask dataframe?

Question

I'm not sure what I'm missing here, I thought dask would resolve my memory issues. I have 100+ pandas dataframes saved in .pickle format. I would like to get them all in the same dataframe but keep running into memory issues. I've already increased the memory buffer in jupyter. It seems I may be missing something in creating the dask dataframe as it appears to crash my notebook after completely filling my RAM (maybe). Any pointers?

Below is the basic process I used:

import pandas as pd
import dask.dataframe as dd

ddf = dd.from_pandas(pd.read_pickle('first.pickle'),npartitions = 8)
for pickle_file in all_pickle_files:
    ddf = ddf.append(pd.read_pickle(pickle_file))
ddf.to_parquet('alldata.parquet', engine='pyarrow')

I've tried a variety of npartitions but no number has allowed the code to finish running.
all in all there is about 30GB of pickled dataframes I'd like to combine
perhaps this is not the right library but the docs suggest dask should be able to handle this

A couple of questions: what is the output of `df.npartitions` after the appends? does a single pickle suits in memory? — rpanai, Aug 04 '20 at 19:48
the appends eventually fail (before completion) and the notebook restarts. — jb4earth, Aug 05 '20 at 00:12
I think the number of partitions may just have needed to be a lot higher than I had set it. — jb4earth, Aug 05 '20 at 00:20

score 1 · Answer 1 · answered Aug 05 '20 at 01:55

Have you considered to first convert the pickle files to parquet and then load to dask? I assume that all your data is in a folder called raw and you want to move to processed

import pandas as pd
import dask.dataframe as dd
import os

def convert_to_parquet(fn, fldr_in, fldr_out):
    fn_out = fn.replace(fldr_in, fldr_out)\
               .replace(".pickle", ".parquet")
    df = pd.read_pickle(fn)
    # eventually change dtypes
    
    df.to_parquet(fn_out, index=False)

fldr_in = 'data'
fldr_out = 'processed'
os.makedirs(fldr_out, exist_ok=True)

# you could use glob if you prefer
fns = os.listdir(fldr_in)
fns = [os.path.join(fldr_in, fn) for fn in fns]

If you know than no more than one file fits in memory you should use a loop

for fn in fns:
    convert_to_parquet(fn, fldr_in, fldr_out)

If you know that more files fit in memory you can use delayed

from dask import delayed, compute

# this is lazy
out = [delayed(fun)(fn) for fn in fns]
# now you are actually converting
out = compute(out)

Now you can use dask to do your analysis.

Pandas dataframes too large to append to dask dataframe?

1 Answers1

Linked