I am attempting to read a very large parquet file (10GB), which i have no control of how is generated (i.e making the file parts smaller for example).
How to best read/write this data? I'm thinking either streaming it from the file or buffer?
my current code looks like this:
from pyspark.sql import SQLContext
from pyspark.sql.types import *
if __name__ == "__main__":
sc = SparkContext(appName="Parquet2CSV")
sqlContext = SQLContext(sc)
readdf = sqlContext.read.parquet('infile.parquet')
readdf.write.csv('outfile.csv')
this works fine for small files..but for the large one, it fails (essentially blows my heap).
I was able to get a successful return code w/o the write, but with the write, it fails.
What would the best way to do this for large files?