2

My scenario of spark job is to connect to PostgreSQL database, read the data from PSQL and performing aggregations after reading the data. In this process I am able to establish database connection successfully and while selecting rows from a table I am facing

ERROR : java.lang.OutOfMemoryError java heap space

And

ERROR : java.lang.OutOfMemoryError: GC overhead limit exceeded

To resolve heap space issue I have added below config in spark-defaults.conf file. This works fine

spark.driver.memory 1g

In order to solve GC overhead limit exceeded issue I have added below config

spark.executor.memory 1g
spark.executor.extraJavaOptions Xmx1024m
spark.dirver.maxResultSize 2g

These configurations didn't work to 100% still i am facing same issue. Along which I am also getting

PSQL ERROR : org.postgresql.util.psqlexception ran out of memory retrieving query results

I am facing these issues while I am dealing with tables that have huge no.of rows i.e., news_mentions table has 4 540 092 records and size of table is 5 476 MB. SO it is taking even more time to execute spark-job which has to be done within seconds.

Here is my actual code.

from pyspark.sql import SparkSession
from pyspark import SparkContext
from pyspark.sql import SQLContext
from pyspark.sql import HiveContext

from pyspark.sql import DataFrameReader


sc = SparkContext()

sqlContext = SQLContext(sc)
sqlContext = HiveContext(sc)

input_media_source = " Times of India"

# Connecting to news_mentions table
df1 = sqlContext.read.format("jdbc").option("url", "jdbc:postgresql://localhost:5432/testdb").option("dbtable", "news_mentions").option("user", "postgres").load()
df1.createOrReplaceTempView("news_mentions")
news_mentions_DF = sqlContext.sql("SELECT * FROM news_mentions")

news_mentions_DF.show()

I'm facing GC limit exceeded error while performing show(). How to run my pyspark job quickly with high performance without any errors?

NOTE : I am running my pyspark job using spark-submit command without starting any standalone cluster mode.

My spark version - 2.2.0 with python version - 3.5.2

  • Increase the executor memory! That's what you need. – pissall Apr 30 '18 at 11:08
  • to which extent i should increase ? Currently I have set that configuration to 2gb @pissall – Jaya Sree Meruga Apr 30 '18 at 11:16
  • on my local system, I set it to 6g. I haven't used a cluster to be honest, but please check out the documentation. – pissall Apr 30 '18 at 11:17
  • This is perhaps the most certain sign that the JVM has insufficient memory and it's tried its best to get rid of garbage with little success. If you're fetching from a large table, then you know that it's too large for your spark system. If there's more available memory than you've given Java, then look into setting memory parameters explicitly (such as `--executor-memory` or `--driver-memory`). It could also be that you just need more executors (`--num-executors` may need to be set to a higher number). Using `spark.executor.extraJavaOptionsXmx1024m` is not correct as per the documentation. – ernest_k Apr 30 '18 at 11:30
  • [How to improve performance for slow Spark jobs using DataFrame and JDBC connection?](https://stackoverflow.com/q/32188295) – zero323 Apr 30 '18 at 12:28

0 Answers0