Pyspark break Large Dataframe into Multiple CSVs with a chunk size of 8 lacs giving Py4JJavaError

Question

I am trying to break a large dataframe (7 million records) into multiple csv file of 800k each. Below is the complete code

from pyspark.sql import SparkSession

options = {
    "pathGlobalFilter": "*.csv",
    "header": "True",
}

spark = SparkSession.builder.config("spark.driver.host","localhost").appName("CSV Reader").getOrCreate()

csv_path = "C:\\Users\\rajat.kapoor\\Desktop\\155_Lacs_Raw_Data\\OutputFiles_CSV"

RawData_Combined_Revolt_Only = spark.read.format("csv").options(**options).load(csv_path)

# Import necessary libraries
from pyspark.sql.functions import monotonically_increasing_id

# Define the chunk size
chunk_size = 800000

# Add a unique ID column to the dataframe
RawData_Combined_Revolt_Only = RawData_Combined_Revolt_Only.withColumn("id", monotonically_increasing_id())

# Repartition the dataframe based on the chunk size
RawData_Combined_Revolt_Only = RawData_Combined_Revolt_Only.repartition((RawData_Combined_Revolt_Only.count() / chunk_size) + 1)

# Write each partition to a separate CSV file
RawData_Combined_Revolt_Only.write.csv("C:\\Users\\rajat.kapoor\\Desktop\\Output PySpark Folder", header=True, mode="overwrite")

It is giving the below error

Py4JJavaError

with the main error in the image shown below

score 2 · Answer 1 · answered Mar 22 '23 at 12:35

Solved it by looking at the article Spark 1.6-Failed to locate the winutils binary in the hadoop binary path So the shell by default searches inside the HADOOP_HOME/bin so, i had set HADOOP_HOME/bin in path, it was searching winutils.exe inside HADOOP_HOME/bin/bin and throwing an error.

score 0 · Answer 2 · answered Mar 22 '23 at 06:06

0

Reason for error:

Because, the folowing code return a value of type float and the repartition expects it to be a int type.

(RawData_Combined_Revolt_Only.count() / chunk_size)

Wrap your code with int() function.

int((RawData_Combined_Revolt_Only.count() / chunk_size)) + 1

answered Mar 22 '23 at 06:06

arudsekaberne

830
4
11

It is throwing Py4JJavaError. Main Problem occuring in this line RawData_Combined_Revolt_Only.write.csv("C:\\Users\\rajat.kapoor\\Desktop\\Output PySpark Folder", header=True, mode="overwrite") – RajatK350 Mar 22 '23 at 06:13
Update your question with the error. – arudsekaberne Mar 22 '23 at 06:54
updated the question with the error – RajatK350 Mar 22 '23 at 07:21
I mean update the code and error part in the question. – arudsekaberne Mar 22 '23 at 07:29
Sorry for the mistake. Updated the Code as well as stated the error in detail – RajatK350 Mar 22 '23 at 07:48
Any idea @arudsekaberne why this error is happening. It is a pretty critical issue for me. I am not able to export csv on my local machine – RajatK350 Mar 22 '23 at 09:15

Pyspark break Large Dataframe into Multiple CSVs with a chunk size of 8 lacs giving Py4JJavaError

2 Answers2