9

I am new to PySpark. I have been writing my code with a test sample. Once I run the code on the larger file(3gb compressed). My code is only doing some filtering and joins. I keep getting errors regarding py4J.

Any help would be useful, and appreciated.

from pyspark.sql import SparkSession
from pyspark.conf import SparkConf

ss = SparkSession \
      .builder \
      .appName("Example") \
      .getOrCreate()

ss.conf.set("spark.sql.execution.arrow.enabled", 'true')

df = ss.read.csv(directory + '/' + filename, header=True, sep=",")
# Some filtering and groupbys...
df.show()

Return

Py4JJavaError: An error occurred while calling o88.showString.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
stage 3.0 failed 1 times, most recent failure: Lost task 0.0 in stage 3.0 (TID 
1, localhost, executor driver): java.lang.OutOfMemoryError: Java heap space
...
Caused by: java.lang.OutOfMemoryError: Java heap space

UPDATE: I was using py4j 10.7 and just updated to 10.8

UPDATE(1): Adding spark.driver.memory:

 ss = SparkSession \
  .builder \
  .appName("Example") \
  .config("spark.driver.memory", "16g")\
  .getOrCreate()

Summarized Return Error:

ERROR:py4j.java_gateway:An error occurred while trying to connect to the Java server (127.0.0.1:38004)

py4j.protocol.Py4JNetworkError: Answer from Java side is empty
"Error while receiving", e, proto.ERROR_ON_RECEIVE)
py4j.protocol.Py4JNetworkError: Error while receiving

Py4JError
Py4JError: An error occurred while calling o94.showString

UPDATE(2) : I tried this, by changing the spark-defaults.conf file. Still getting error PySpark: java.lang.OutofMemoryError: Java heap space

SEMI-SOLVED : This seemed to be a general memory problem. I started a 2xlarge instance with 32g of memory. The program runs with no errors.

Knowing this, is there something else, a conf option that could help so I don't have to run an expensive instance?

Thanks Everyone.

TChi
  • 383
  • 1
  • 6
  • 14
  • How much memory has been allocated to the Driver? – HarryClifton Feb 06 '19 at 09:38
  • @SurajRamesh I am using an aws cloud. I have used this .config("spark.executor.memory", "16g"). It didn't make a difference. – TChi Feb 06 '19 at 14:36
  • Try setting `spark.driver.memory` to `16g`. Does your could work for smaller datasets? `.config("spark.driver.memory", "16g")` – GeneticsGuy Feb 06 '19 at 15:03
  • 1
    @GeneticsGuy I took your advice and got a different error: Py4JError: An error occurred while calling o94.showString – TChi Feb 06 '19 at 15:40
  • You may have to post the filtering and groupby methods you are using. Spark's lazy evaluation leads to error messages being shown for the last method when it is earlier methods that are the cause. – GeneticsGuy Feb 06 '19 at 15:52
  • You could try allocating more memory to the JVM by increasing the Java heap memory, and then reducing driver memory to see if you can run your application on a smaller instance. – HarryClifton Feb 07 '19 at 02:37

2 Answers2

1

This is a current issue with pyspark 2.4.0 installed via conda. You'll want to downgrade to pyspark 2.3.0 via conda prompt or Linux terminal:

    conda install pyspark=2.3.0
0

You may not have right permissions.

I have the same problem when I use a docker image jupyter/pyspark-notebook to run an example code of pyspark, and it was solved by using root within the container.

Anyone also use the image can find some tips here.

RobC
  • 22,977
  • 20
  • 73
  • 80
Secain
  • 56
  • 6