Getting OutofMemoryError- GC overhead limit exceed in pyspark

Question

in the middle of project i am getting bellow error after invoking a function in my spark sql query

i have written a user define function which will take two string and concat them after concatenation it will take right most substring length of 5 depend on total string length(alternate method of right(string,integer) of sql server )

  from pyspark.sql.types import*


def concatstring(xstring, ystring):
            newvalstring = xstring+ystring
            print newvalstring
            if(len(newvalstring)==6):
                stringvalue=newvalstring[1:6]
                return stringvalue
            if(len(newvalstring)==7):
                stringvalue1=newvalstring[2:7]
                return stringvalue1
            else:
                return '99999'


spark.udf.register ('rightconcat', lambda x,y:concatstring(x,y), StringType())

it works fine individually. now when i pass it in my spark sql query as column this exception occured the query is

the written query is

spark.sql("select d.BldgID,d.LeaseID,d.SuiteID,coalesce(BLDG.BLDGNAME,('select EmptyDefault from EmptyDefault')) as LeaseBldgName,coalesce(l.OCCPNAME,('select EmptyDefault from EmptyDefault'))as LeaseOccupantName, coalesce(l.DBA, ('select EmptyDefault from EmptyDefault')) as LeaseDBA, coalesce(l.CONTNAME, ('select EmptyDefault from EmptyDefault')) as LeaseContact,coalesce(l.PHONENO1, '')as LeasePhone1,coalesce(l.PHONENO2, '')as LeasePhone2,coalesce(l.NAME, '') as LeaseName,coalesce(l.ADDRESS, '') as LeaseAddress1,coalesce(l.ADDRESS2,'') as LeaseAddress2,coalesce(l.CITY, '')as LeaseCity, coalesce(l.STATE, ('select EmptyDefault from EmptyDefault'))as LeaseState,coalesce(l.ZIPCODE, '')as LeaseZip, coalesce(l.ATTENT, '') as LeaseAttention,coalesce(l.TTYPID, ('select EmptyDefault from EmptyDefault'))as LeaseTenantType,coalesce(TTYP.TTYPNAME, ('select EmptyDefault from EmptyDefault'))as LeaseTenantTypeName,l.OCCPSTAT as LeaseCurrentOccupancyStatus,l.EXECDATE as LeaseExecDate, l.RENTSTRT as LeaseRentStartDate,l.OCCUPNCY as LeaseOccupancyDate,l.BEGINDATE as LeaseBeginDate,l.EXPIR as LeaseExpiryDate,l.VACATE as LeaseVacateDate,coalesce(l.STORECAT, (select EmptyDefault from EmptyDefault)) as LeaseStoreCategory ,rightconcat('00000',cast(coalesce(SCAT.SORTSEQ,99999) as string)) as LeaseStoreCategorySortID from Dim_CMLease_primer d join LEAS l on l.BLDGID=d.BldgID and l.LEASID=d.LeaseID left outer join SUIT on SUIT.BLDGID=l.BLDGID and SUIT.SUITID=l.SUITID left outer join BLDG on BLDG.BLDGID= l.BLDGID left outer join SCAT on SCAT.STORCAT=l.STORECAT left outer join TTYP on TTYP.TTYPID = l.TTYPID").show()

i have uploaded the the query and after query state here. how could i solve this problem. Kindly guide me

Make sure your `spark.memory.fraction=0.6` . If it is higher than that you run into garbage collection errors, see https://stackoverflow.com/a/47283211/179014 — asmaier, Nov 14 '17 at 10:25

score 11 · Accepted Answer · edited May 23 '17 at 12:32

11

The simplest thing to try would be increasing spark executor memory: spark.executor.memory=6g
Make sure you're using all the available memory. You can check that in UI.

UPDATE 1

--conf spark.executor.extrajavaoptions="Option" you can pass -Xmx1024m as an option.

What's your current spark.driver.memory and spark.executor.memory?
Increasing them should resolve the problem.

Bear in mind that according to spark documentation:

Note that it is illegal to set Spark properties or heap size settings with this option. Spark properties should be set using a SparkConf object or the spark-defaults.conf file used with the spark-submit script. Heap size settings can be set with spark.executor.memory.

UPDATE 2

As GC overhead error is garbage collcection problem would also recommend to read this great answer

edited May 23 '17 at 12:32

Community

1
1

answered Dec 06 '16 at 09:51

Jarek

209
3
8

Thank you for your reply. i have just tried it out. but nothing changes ubuntu@tvnubtest:~/spark-2.0.0-bin-hadoop2.7$ bin/pyspark --conf executor.extraClassPath=$SPARK_HOME/lib/sqljdbc4.jar --driver-class-path $SPARK_HOME/lib/sqljdbc4.jar --jars $SPARK_HOME/lib/sqljdbc4.jar --executor-memory 6g – Kalyan Dec 06 '16 at 10:05
Try allowing your JVM more Java heap space by: `java -Xmx1024m com.yourName.yourClass` – Jarek Dec 06 '16 at 11:46
If your objects are consuming too much memory, it will allow JVM to run smoothly – Jarek Dec 06 '16 at 11:48
could i write above config code like passing parameter in spark which i have posted previous comment thread or do i need extra house keeping for that. i am newbie in spark so i am slightly confused. sorry for asking silly question – Kalyan Dec 07 '16 at 04:19
1

However I'd focus on 3 parameters: `spark.driver.memory=45g` `spark.executor.memory=6g` `spark.dirver.maxResultSize=8g` It's just an example of my config that sorted similar problem out. Play with your config, but before check how much available memory you have in UI. – Jarek Dec 07 '16 at 09:23
thanks for your reply , I have increased the physical memory in 16G. the problem solved temporarily though. but getting Overhead again. I guess your java-Xmx1024m code will be needed now. I am using pyspark and loading the existing table using jdbc driver in Linux pc. could you say which class I need to focus I mean com.yourname,yourclass since I didn't create any class . sorry for bothering you much :( – Kalyan Dec 13 '16 at 08:27
`--conf spark.executor.extrajavaoptions="Option"` you can pass `-Xmx1024m` as an option. Bear in mind that according to spark documentation: >Note that it is illegal to set Spark properties or heap size settings with >this option. Spark properties should be set using a SparkConf object or >the spark-defaults.conf file used with the spark-submit script. Heap >size settings can be set with spark.executor.memory. What's your current `spark.driver.memory` and `spark.executor.memory`? – Jarek Dec 13 '16 at 11:03
sorry, can't edit the above comment `--conf spark.executor.extrajavaoptions="Option"` you can pass `-Xmx1024m` as an option. What's your current spark.driver.memory and spark.executor.memory? Bear in mind that according to spark documentation: >Note that it is illegal to set Spark properties or heap size settings with this option. Spark properties should be set using a SparkConf object or the spark-defaults.conf file used with the spark-submit script. Heap size settings can be set with spark.executor.memory. – Jarek Dec 13 '16 at 11:09
1

well i have done some tweak here . i have increased PC memory to 32G and have set driver memory to 24G and executor memory to 8G right now.and it worked for now i guess. i am not getting exception .and currently trying to increase java heapsize for avoiding trouble in future.Thank you very much for your details reply jarek for now. but might knock you if i am getting odd exception again :P kidding – Kalyan Dec 14 '16 at 13:46

Getting OutofMemoryError- GC overhead limit exceed in pyspark

1 Answers1