2

I'm facing some problems regarding the memory issue, but I'm unable to solve it. Any help is highly appreciated. I am new to Spark and pyspark functionalities and trying to read a large JSON file that is around 5GB in size and build the rdd using

df = spark.read.json("example.json")

Every time I run the above statement, I get the following error :

java.lang.OutOfMemoryError : Java heap space

I need to get the JSON data in form of RDD and then use SQL Spark for manipulating and analysing. But I get error at the first step(reading JSON) itself. I am aware that to read such large files necessary changes in the configuration of the Spark Session is required. I followed the answers given at Apache Spark: Job aborted due to stage failure: "TID x failed for unknown reasons" and Spark java.lang.OutOfMemoryError: Java heap space

I tried to change my SparkSession's configuration but I think I may have misunderstood some of the settings. The following are my spark configuration.

spark = SparkSession \
.builder \
.appName("Python Spark SQL basic example") \
.config("spark.memory.fraction", 0.8) \
.config("spark.executor.memory", "14g") \
.config("spark.driver.memory", "12g")\
.config("spark.sql.shuffle.partitions" , "8000") \
.getOrCreate()

Is there any mistake in the values that I have set for the different parameters like driver memory and executor memory. Also do I need to a set more config parameters other than this ?

Jenny
  • 355
  • 1
  • 2
  • 10
  • Are you calling `.collect()` or `.show()` over that DataFrame? – Gocht May 22 '18 at 20:49
  • I want to, once I create the RDD df. – Jenny May 22 '18 at 20:58
  • Try using your json file in hdfs. – Gocht May 22 '18 at 22:57
  • 1
    What is the action that you are calling after reading the JSON file? If you aren't, when does the error occur? Is it a normal textfile or compressed? Also, please try without setting `spark.memory.fraction` and [edit] the details in your question. Thanks. – philantrovert May 23 '18 at 08:31
  • If you can split the raw JSON file (hopefully it's an array), do so, and place the splits in a folder, and read the folder. Hopefully you got a tool to split a file that big, I only had to split a 500Mb file, so I used Python. – eltbus Apr 05 '21 at 13:58

1 Answers1

2

Try to use:

df = spark.read.json("example.json").repartition(100)

This is due to shuffle data between too small partitions and memory overhead put all the partitions in heap memory.

My suggestion is to reduce the spark.sql.shuffle.partitions value to minimal and try to use re-partition or parallelism to increase partition of your input/intermediate dataframes.

spark = SparkSession \
  .builder \
  .appName("Python Spark SQL basic example") \
  .config("spark.memory.fraction", 0.8) \
  .config("spark.executor.memory", "14g") \
  .config("spark.driver.memory", "12g")\
  .config("spark.sql.shuffle.partitions" , "800") \
  .getOrCreate()
Shaido
  • 27,497
  • 23
  • 70
  • 73
Rahul Gupta
  • 716
  • 7
  • 14