How to read a large parquet file as multiple dataframes?

Question

I am trying to convert a large parquet file into CSV.Since my RAM is only 8 GB, i get memory error.So is there any way to read parquet into multiple dataframes over a loop?

Surely if one partitions appropriately this can work? As stuff spills to disk... — thebluephantom, Jun 18 '19 at 11:36
Sorry,I couldnt understand your comment.Can you explain this ?@thebluephantom — Rahul, Jun 24 '19 at 10:32
This can be done using pyarrow library. You can read the data in batches, read certain row groups, or read certain columns only. https://stackoverflow.com/a/74163397/6563567 — ns15, Oct 22 '22 at 16:31

score 1 · Answer 1 · answered Jun 18 '19 at 10:59

from pyspark.sql import SparkSession
# initialise sparkContext
spark = SparkSession.builder.master('local').appName('myAppName') \
.config('spark.executor.memory', '4gb').config("spark.cores.max", "6").getOrCreate()

sc = spark.sparkContext

# Use SQLContext to read parquet file
from pyspark.sql import SQLContext
sqlContext = SQLContext(sc)

# Read parquet file
df = sqlContext.read.parquet('ParquetFile.parquet')

I have increased the memory and cores Here. Please try the same and later you can convert to into csv.

Getting this error: py4j.protocol.Py4JJavaError: An error occurred while calling o56.parquet. : java.lang.ClassCastException: java.util.ArrayList cannot be cast to java.lang.String — Rahul, Jun 24 '19 at 10:40

score 1 · Answer 2 · answered Jun 18 '19 at 11:01

You could do this with dask (https://dask.org/), which can work with larger than memory data on your local machine.
Example code to read a parquet file and save again as CSV:

import dask.dataframe as dd

df = dd.read_parquet('path/to/file.parquet')
df.to_csv('path/to/new_files-*.csv')

This will create a collection of CSV files (https://docs.dask.org/en/latest/dataframe-api.html#dask.dataframe.to_csv).
If you need a single CSV file, see this answer to do that: Writing Dask partitions into single file (eg by concatenating them afterwards).

How to read a large parquet file as multiple dataframes?

2 Answers2