I have data in parquet format which is too big to fit into memory (6 GB). I am looking for a way to read and process the file using Python 3.6. Is there a way to stream the file, down-sample, and save to a dataframe
? Ultimately, I would like to have the data in dataframe
format to work with.
Am I wrong to attempt to do this without using a spark framework?
I have tried using pyarrow
and fastparquet
but I get memory errors on trying to read the entire file in.
Any tips or suggestions would be greatly appreciated!