I'm not sure what I'm missing here, I thought dask would resolve my memory issues. I have 100+ pandas dataframes saved in .pickle format. I would like to get them all in the same dataframe but keep running into memory issues. I've already increased the memory buffer in jupyter. It seems I may be missing something in creating the dask dataframe as it appears to crash my notebook after completely filling my RAM (maybe). Any pointers?
Below is the basic process I used:
import pandas as pd
import dask.dataframe as dd
ddf = dd.from_pandas(pd.read_pickle('first.pickle'),npartitions = 8)
for pickle_file in all_pickle_files:
ddf = ddf.append(pd.read_pickle(pickle_file))
ddf.to_parquet('alldata.parquet', engine='pyarrow')
- I've tried a variety of
npartitions
but no number has allowed the code to finish running. - all in all there is about 30GB of pickled dataframes I'd like to combine
- perhaps this is not the right library but the docs suggest dask should be able to handle this