I have a process that writes out a dataframe into a pickle file using the standard protocol df.to_pickle
:
import pandas as pd
#sample example data
data = {'col1': [1, 2, 3, 4, 5],
'col2' : ['a', 'b', 'c', 'd', 'e']}
#create dataframe
df = pd.DataFrame(data)
print(df)
col1 col2
0 1 a
1 2 b
2 3 c
3 4 d
4 5 e
#write out dataframe to a pickle
df.to_pickle('myPickleFile.p')
now i have a second (separate) process that needs to read and process that file in chunks (for memory purposes given my data is extremely large), usually if this was say a txt file or a HDF file, i'd usually do something similar to the below:
for chunk in pd.read_csv('myCSVFile.csv', chunksize = 1000000):
#do stuff example:
print(len(chunk))
The key reason i'm keen to keep the file in pickle format is due to the read/write speeds compared to a txt or HDF files, in my case, it's more than 300% quicker.
It seems that I can't do that with read_pickle
as it doesn't support reading in chunks.
So my question is: is there a way of reading a pickle file in chunks into pandas? if yes, please point me in the right direction.
Thanks.