My scenario of spark job is to connect to PostgreSQL database, read the data from PSQL and performing aggregations after reading the data. In this process I am able to establish database connection successfully and while selecting rows from a table I am facing
ERROR : java.lang.OutOfMemoryError java heap space
And
ERROR : java.lang.OutOfMemoryError: GC overhead limit exceeded
To resolve heap space issue I have added below config in spark-defaults.conf file. This works fine
spark.driver.memory 1g
In order to solve GC overhead limit exceeded issue I have added below config
spark.executor.memory 1g
spark.executor.extraJavaOptions Xmx1024m
spark.dirver.maxResultSize 2g
These configurations didn't work to 100% still i am facing same issue. Along which I am also getting
PSQL ERROR : org.postgresql.util.psqlexception ran out of memory retrieving query results
I am facing these issues while I am dealing with tables that have huge no.of rows i.e., news_mentions table has 4 540 092 records and size of table is 5 476 MB. SO it is taking even more time to execute spark-job which has to be done within seconds.
Here is my actual code.
from pyspark.sql import SparkSession
from pyspark import SparkContext
from pyspark.sql import SQLContext
from pyspark.sql import HiveContext
from pyspark.sql import DataFrameReader
sc = SparkContext()
sqlContext = SQLContext(sc)
sqlContext = HiveContext(sc)
input_media_source = " Times of India"
# Connecting to news_mentions table
df1 = sqlContext.read.format("jdbc").option("url", "jdbc:postgresql://localhost:5432/testdb").option("dbtable", "news_mentions").option("user", "postgres").load()
df1.createOrReplaceTempView("news_mentions")
news_mentions_DF = sqlContext.sql("SELECT * FROM news_mentions")
news_mentions_DF.show()
I'm facing GC limit exceeded error while performing show(). How to run my pyspark job quickly with high performance without any errors?
NOTE : I am running my pyspark job using spark-submit command without starting any standalone cluster mode.
My spark version - 2.2.0 with python version - 3.5.2