I am trying to convert a large parquet file into CSV.Since my RAM is only 8 GB, i get memory error.So is there any way to read parquet into multiple dataframes over a loop?
Asked
Active
Viewed 5,310 times
3
-
Surely if one partitions appropriately this can work? As stuff spills to disk... – thebluephantom Jun 18 '19 at 11:36
-
Sorry,I couldnt understand your comment.Can you explain this ?@thebluephantom – Rahul Jun 24 '19 at 10:32
-
This can be done using pyarrow library. You can read the data in batches, read certain row groups, or read certain columns only. https://stackoverflow.com/a/74163397/6563567 – ns15 Oct 22 '22 at 16:31
2 Answers
1
from pyspark.sql import SparkSession
# initialise sparkContext
spark = SparkSession.builder.master('local').appName('myAppName') \
.config('spark.executor.memory', '4gb').config("spark.cores.max", "6").getOrCreate()
sc = spark.sparkContext
# Use SQLContext to read parquet file
from pyspark.sql import SQLContext
sqlContext = SQLContext(sc)
# Read parquet file
df = sqlContext.read.parquet('ParquetFile.parquet')
I have increased the memory and cores Here. Please try the same and later you can convert to into csv.

Prathik Kini
- 1,067
- 11
- 25
-
Getting this error: py4j.protocol.Py4JJavaError: An error occurred while calling o56.parquet. : java.lang.ClassCastException: java.util.ArrayList cannot be cast to java.lang.String – Rahul Jun 24 '19 at 10:40
1
You could do this with dask
(https://dask.org/), which can work with larger than memory data on your local machine.
Example code to read a parquet file and save again as CSV:
import dask.dataframe as dd
df = dd.read_parquet('path/to/file.parquet')
df.to_csv('path/to/new_files-*.csv')
This will create a collection of CSV files (https://docs.dask.org/en/latest/dataframe-api.html#dask.dataframe.to_csv).
If you need a single CSV file, see this answer to do that: Writing Dask partitions into single file (eg by concatenating them afterwards).

joris
- 133,120
- 36
- 247
- 202