I'm facing some problems regarding the memory issue, but I'm unable to solve it. Any help is highly appreciated. I am new to Spark and pyspark functionalities and trying to read a large JSON file that is around 5GB in size and build the rdd using
df = spark.read.json("example.json")
Every time I run the above statement, I get the following error :
java.lang.OutOfMemoryError : Java heap space
I need to get the JSON data in form of RDD and then use SQL Spark for manipulating and analysing. But I get error at the first step(reading JSON) itself. I am aware that to read such large files necessary changes in the configuration of the Spark Session is required. I followed the answers given at Apache Spark: Job aborted due to stage failure: "TID x failed for unknown reasons" and Spark java.lang.OutOfMemoryError: Java heap space
I tried to change my SparkSession's configuration but I think I may have misunderstood some of the settings. The following are my spark configuration.
spark = SparkSession \
.builder \
.appName("Python Spark SQL basic example") \
.config("spark.memory.fraction", 0.8) \
.config("spark.executor.memory", "14g") \
.config("spark.driver.memory", "12g")\
.config("spark.sql.shuffle.partitions" , "8000") \
.getOrCreate()
Is there any mistake in the values that I have set for the different parameters like driver memory and executor memory. Also do I need to a set more config parameters other than this ?