13

I'm trying to run this code:

import pyspark
from pyspark.sql import SparkSession

spark = SparkSession.builder \
        .master("local") \
        .appName("Word Count") \
        .getOrCreate()

df = spark.createDataFrame([
    (1, 144.5, 5.9, 33, 'M'),
    (2, 167.2, 5.4, 45, 'M'),
    (3, 124.1, 5.2, 23, 'F'),
    (4, 144.5, 5.9, 33, 'M'),
    (5, 133.2, 5.7, 54, 'F'),
    (3, 124.1, 5.2, 23, 'F'),
    (5, 129.2, 5.3, 42, 'M'),
   ], ['id', 'weight', 'height', 'age', 'gender'])

df.show()
print('Count of Rows: {0}'.format(df.count()))
print('Count of distinct Rows: {0}'.format((df.distinct().count())))

spark.stop()

And getting an error

18/06/22 11:58:39 ERROR SparkUncaughtExceptionHandler: Uncaught exception in thread Thread[main,5,main]
java.util.NoSuchElementException: key not found: _PYSPARK_DRIVER_CALLBACK_HOST
    ...
Exception: Java gateway process exited before sending its port number

I'm using PyCharm and MacOS, Python 3.6, Spark 2.3.1

What is the possible reason of this error?

Alper t. Turker
  • 34,230
  • 9
  • 83
  • 115
bboy
  • 183
  • 1
  • 1
  • 8

7 Answers7

16

This error is a result of a version mismatch. Environment variable which is referenced in the traceback (_PYSPARK_DRIVER_CALLBACK_HOST) has been removed during update Py4j dependency to 0.10.7 and backported to 2.3 branch in 2.3.1.

Considering version information:

I'm using PyCharm and MacOS, Python 3.6, Spark 2.3.1

it looks like you have 2.3.1 package installed, but SPARK_HOME points to an older (2.3.0 or earlier) installation.

Alper t. Turker
  • 34,230
  • 9
  • 83
  • 115
  • thanks, after update my spark version to 2.3.1 it works fine. – bboy Jun 22 '18 at 19:50
  • 1
    Upgrading from 2.3 to 2.3.1 worked for me too. Thanks! – keberwein Aug 01 '18 at 16:49
  • 1
    @user8371915 I'm using python3.5 Spark 2.1.0 and getting same error "java.util.NoSuchElementException: key not found: _PYSPARK_DRIVER_CALLBACK_HOST". **# .bashrc export PYSPARK_PYTHON=/usr/bin/python3.5 export PYSPARK_DRIVER_PYTHON=/usr/bin/python3.5 export SPARK_HOME=/opt/spark-2.1.0-bin-hadoop2.7 export SCALA_HOME=/opt/scala-2.11.8 export HADOOP_HOME=/opt/hadoop-2.7.7 export PATH=$PATH:$JAVA_HOME/bin:$JRE_HOME/bin:$SCALA_HOME/bin:$SPARK_HOME/bin:$HADOOP_HOME/bin export PATH export PYTHONPATH=$SPARK_HOME/python/lib/py4j-0.10.4-src.zip:$PYTHONPATH** Any ideas? – user3456269 Sep 26 '18 at 23:08
13

This resolution that I'm about to render also takes care of the "key not found: _PYSPARK_DRIVER_CALLBACK_HOST/Java Gateway/PySpark 2.3.1" error!! Add to your bashrc or /etc/environment or /etc/profile

export PYTHONPATH=$SPARK_HOME/python/:$PYTHONPATH
export PYTHONPATH=$SPARK_HOME/python/lib/py4j-0.8.2.1-src.zip:$PYTHONPATH

That should do the doobie right there. You may thank me in advance. #thumbsup :)

SCOTT McNEIL
  • 139
  • 4
2

I got the same error key not found: _PYSPARK_DRIVER_CALLBACK_HOST while upgrading to Spark 3.1.1.

What worked for me was upgrading pyspark via pip install pyspark==3.1.1, installing findspark, and then running the following lines before starting the SparkSession:

import findspark
findspark.init()

If you have multiple version of Spark on system use path

  import findspark
  findspark.init("C:\spark-2.3.0")
vaquar khan
  • 10,864
  • 5
  • 72
  • 96
0

the env var in .bash_profile or /etc/profile may not be accessed by your code ,put them in your code directly.

import os
import sys

os.environ['SPARK_HOME'] = "/opt/cloudera/parcels/SPARK2/lib/spark2"
os.environ['PYSPARK_SUBMIT_ARGS'] = "--master yarn pyspark-shell" 




sys.path.append(os.path.join(os.environ['SPARK_HOME'], "python"))
sys.path.append(os.path.join(os.environ['SPARK_HOME'], "python/lib/py4j-0.10.6-src.zip"))


try:
    from pyspark import SparkContext
    from pyspark.sql import SparkSession

    from pyspark import SparkConf

    print("success")

except ImportError as e:
    print("error importing spark modules", e)
    sys.exit(1)

xfly
  • 31
  • 7
  • while your answer goes in the direction of the question, I do suggest you to write more in-code comments, explaining why the error appeared. – giosans Jan 15 '19 at 13:21
0

I have got the similar errors: java.util.NoSuchElementException: key not found: _PYSPARK_DRIVER_CALLBACK_HOST and Exception: Java gateway process exited before sending its port number

Running the command "export PYTHONPATH=$SPARK_HOME/python/:$PYTHONPATH" or setting this to .bashrc resolved the issue.

Please also check if the mapr credentails are setup.

0

I had same issue and all above settings did not work for me. Actually i had SPARK_HOME already set. Finally the issue was because i simply installed pyspark using pip install pyspark without verifying the version. After a lot of debugging inside the code , figured out that the _PYSPARK_DRIVER_CALLBACK_HOST inside

anaconda3/lib/python3.7/site-packages/pyspark/java_gateway.py

did not have this variable whereas older versions of pyspark has it (I am using anaconda hence this file path location. The exact location of the file might be different for others)

Finally came to the conclusion that it was due to version mismatch. Seems very stupid enough , but i guess it might help others a lot of debugging time.

solution is to find out the spark version that is installed for eg 2.3.0 and then ensure to install pyspark of same version pip install pyspark==2.3.0 . After this it worked like a charm.

Note : this issue occurs only if i call SparkSession.builder.appName within python . It was working fine even with the version mismatch for pyspark and spark-submit commands , and that's why it easily skipped my mind that it can be due to version mismatch.

niths4u
  • 440
  • 3
  • 13
0
pip install pyspark==2.3.0

is ok

jizhihaoSAMA
  • 12,336
  • 9
  • 27
  • 49