1

I am trying to access hive parquet table and load it to a Pandas data frame. I am using pyspark and my code is as below:

import pyspark
import pandas
from pyspark import SparkConf
from pyspark import SparkContext
from pyspark.sql import SQLContext
from pyspark.sql import HiveContext

conf = (SparkConf().set("spark.driver.maxResultSize", "10g").setAppName("buyclick").setMaster('yarn-client').set("spark.driver.memory", "4g").set("spark.driver.cores","4").set("spark.executor.memory", "4g").set("spark.executor.cores","4").set("spark.executor.extraJavaOptions","-XX:-UseCompressedOops"))

sc = SparkContext(conf=conf)    
sqlContext = HiveContext(sc)
results = sqlContext.sql("select * from buy_click_p")
res_pdf = results.toPandas()

This has failed continuously what so ever I change to conf parameters and everytime it fails as Java heap issue:

Exception in thread "task-result-getter-2" java.lang.OutOfMemoryError: Java heap space

Here are some other information about environment:

Cloudera CDH version : 5.9.0
Hive version : 1.1.0
Spark Version : 1.6.0
Hive table size : hadoop fs -du -s -h /path/to/hive/table/folder --> 381.6 M  763.2 M

Free memory on box : free -m 
            total  used    free    shared buffers cached
Mem:        23545 11721   11824        12     258   1773
trincot
  • 317,000
  • 35
  • 244
  • 286
Rahul
  • 71
  • 1
  • 4
  • The below post might help. https://stackoverflow.com/questions/47536123/collect-or-topandas-on-a-large-dataframe-in-pyspark-emr – args Aug 27 '18 at 11:42

2 Answers2

0

My original issue of heap space is now fixed , seems my driver memory was not optimum . Setting driver memory from pyspark client does not take effect as container is already created by that time , thus I had to set it at spark environmerent properties in CDH manager console. To set that I went to Cloudera Manager > Spark > Configuration > Gateway > Advanced > in Spark Client Advanced Configuration Snippet (Safety Valve) for spark-conf/spark-defaults.conf I added spark.driver.memory=10g and Java heap issue was solved . I think this will work when you're running your spark application on Yarn-Client.

However after spark job is finished the application hangs on toPandas , does anyone has any idea what specific properties need to set for conversion of dataframe toPandas ?

-Rahul

Rahul
  • 71
  • 1
  • 4
0

I had a same issue. After I changed the driver memory it works for me. A set in my code:

spark = SparkSession.builder.appName("something").config("spark.driver.memory","10G").getOrCreate()

I set to 10G but it depends on your environment, how big is your cluster.

Aron Asztalos
  • 824
  • 8
  • 7