0

I am running the following spark code and each time it runs out of memory. The data size is not huge but overtime it complains for GC error so there are too many objects for garbage collector to collect. Selecting few columns and data from a table should not contain much overhead and create too many objects on Heap. Am i creating too many immutable objects by firing a select query. Not sure why it's complaining for GC error.

object O {  

def main(args: Array[String]): Unit = {  

val log = LogManager.getRootLogger    
val TGTTBL = "XYZ"  
val outputLocation = "somepath"  
val dql = "select col1, col2, col3 from SOURCETBL where condition"
val appName = "createDFS";
val spark = SparkSession.builder()
.appName(appName)
.enableHiveSupport()
.config("hive.exec.dynamic.partition", "true")
.config("hive.exec.dynamic.partition.mode", "nonstrict")
.getOrCreate()

 log.warn("Select Query........................")
 log.warn(dql)
 val tblDF = spark.sql(dql)

 def getCols(df: DataFrame): String = {
 val cols = df.columns;
 val colString = cols.map(c => s"$c    


${df.schema(s"$c").dataType}").dropRight(3).mkString(",").replace("Type", "")

 return colString;
 }

 val colString = getCols(tblDF)
 log.warn("Create Table def........................")
 log.warn(colString)
 spark.sql(s"drop table if exists $TGTTBL")
 spark.sql(s"Create external table if not exists $TGTTBL ($colString)" +
        s" partitioned by (col1 string, col2 string, col3 string) stored as orc location \'$outputLocation\'")

 tblDF.write.partitionBy("col1", "col2", ).format("orc").mode("overwrite").save(outputLocation)

}
}


**Error - 
Caused by: java.lang.OutOfMemoryError: GC overhead limit exceeded**
Swapnil
  • 159
  • 1
  • 11
  • Are you running this on local or Cluster? what about memory configurations? – Shankar Jun 14 '17 at 09:47
  • What about using hive-context?? – Raktotpal Bordoloi Jun 14 '17 at 10:18
  • Does this answer help you? https://stackoverflow.com/a/22742982/6059889 – Steffen Schmitz Jun 14 '17 at 10:44
  • @Shankar - I am running on cluster. The data is under 1gb and i am providing enough memory and executors to run the job. – Swapnil Jun 14 '17 at 16:09
  • @RaktotpalBordoloi - what is the advantage of using hive context over hive support in a spark session ? – Swapnil Jun 14 '17 at 16:11
  • 1
    @SteffenSchmitz - None of the scenarios described fits my case. I read it earlier. I will try to use hive context but not sure how will it be any different from what i am using. – Swapnil Jun 14 '17 at 16:17
  • I kept getting the above error while trying to read through spark sql. What I did was to read the files as RDDs and then create a DF from it. Not sure why it throws GC error while trying to create DF from Spark sql. – Swapnil Jun 20 '17 at 00:00

1 Answers1

0

I kept getting the above error while trying to read through spark sql. So I created an RDD[Object] from the input files and converted it to dataframe using rdd.toDF() method. This solved my problem.

Swapnil
  • 159
  • 1
  • 11