1

I am using PySpark and the spark-redshift driver to load data into a redshift table. Because I read that 'com.databricks.spark.redshift' does not work with Spark 2.3.1, I am using Spark 2.1.0 and Python 3.5.6

Python 3.5.6 (default, Sep 26 2018, 21:49:11)
[GCC 7.3.1 20180303 (Red Hat 7.3.1-5)] on linux
Type "help", "copyright", "credits" or "license" for more information.
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
18/10/09 15:41:25 WARN ObjectStore: Version information not found in metastore. hive.metastore.schema.verification is not enabled so recording the schema version 1.2.0
18/10/09 15:41:25 WARN ObjectStore: Failed to get database default, returning NoSuchObjectException
18/10/09 15:41:26 WARN ObjectStore: Failed to get database global_temp, returning NoSuchObjectException
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 2.1.0
      /_/

Using Python version 3.5.6 (default, Sep 26 2018 21:49:11)

When I try to create my SparkContext from my script

sc = SparkContext(conf=conf)

I get the following error

Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
18/10/09 15:34:46 ERROR SparkUncaughtExceptionHandler: Uncaught exception in thread Thread[main,5,main]
java.util.NoSuchElementException: key not found: _PYSPARK_DRIVER_CALLBACK_HOST
    at scala.collection.MapLike$class.default(MapLike.scala:228)
    at scala.collection.AbstractMap.default(Map.scala:59)
    at scala.collection.MapLike$class.apply(MapLike.scala:141)
    at scala.collection.AbstractMap.apply(Map.scala:59)
    at org.apache.spark.api.python.PythonGatewayServer$$anonfun$main$1.apply$mcV$sp(PythonGatewayServer.scala:50)
    at org.apache.spark.util.Utils$.tryOrExit(Utils.scala:1228)
    at org.apache.spark.api.python.PythonGatewayServer$.main(PythonGatewayServer.scala:37)
    at org.apache.spark.api.python.PythonGatewayServer.main(PythonGatewayServer.scala)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:738)
    at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:187)
    at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:212)
    at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:126)
    at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)

Reading this post, it sounds like the error is related to a version mismatch. I don't believe I have a version mismatch though:

[ec2-user@ip-172-31-50-110 ~]$ echo $SPARK_HOME
/opt/spark-2.1.0-bin-hadoop2.7
[ec2-user@ip-172-31-50-110 ~]$ echo $PYSPARK_PYTHON
/usr/bin/python3.5
[ec2-user@ip-172-31-50-110 ~]$ echo $PYTHONPATH
/usr/bin/python3.5

I'm not sure where to go from here in diagnosing the issue. Any help here is greatly greatly appreciated!

EDIT: Output from spark-shell --version

[ec2-user@ip-172-31-50-110 ~]$ spark-shell --version
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 2.1.0
      /_/

Using Scala version 2.11.8, Java HotSpot(TM) 64-Bit Server VM, 1.8.0_181
Branch
Compiled by user jenkins on 2016-12-16T02:04:48Z
Revision
Url
Type --help for more information.

And output from $PYSPARK_PYTHON -c "import pyspark; print(pyspark.version)"

[ec2-user@ip-172-31-50-110 ~]$ $PYSPARK_PYTHON -c "import pyspark; print(pyspark.__version__)"
2.3.2
user3456269
  • 465
  • 2
  • 4
  • 14
  • Please include output from `spark-shell --version` and `$PYSPARK_PYTHON -c "import pyspark; print(pyspark.__version__)"` – zero323 Oct 09 '18 at 16:14
  • Thanks @user6910411! I've edited the above to show both. It appears they are different versions - Any advice on how I fix? – user3456269 Oct 09 '18 at 17:22
  • Sounds like you've installed PySpark in your environment (likely using `pip` or a similar tool). Typically such tools come with corresponding uninstall option (like `pip uninstall pyspark` or `conda uninstall pyspark`) :) If you have problem with determining the location try `$PYSPARK_PYTHON -c "import pyspark; print(pyspark.__file__)"` – zero323 Oct 09 '18 at 17:27
  • Thanks! @user6910411 I uninstalled that version and reinstalled using `python3.5 -m pip install -Iv pyspark==2.1.3`. I now have `[ec2-user@ip-172-31-50-110 ~]$ $PYSPARK_PYTHON -c "import pyspark; print(pyspark.__version__)" 2.1.3`, but am getting the same error. Do the versions need to match exactly (right now it's Spark 2.1.0 and PySpark 2.1.3), or am I missing something else? – user3456269 Oct 09 '18 at 18:36
  • Theoretically minor versions shouldn't make a difference, though as you can see on the linked example, that's not always the case. However if the error is still the same it suggests you still have 2.3.1+ somewhere on the path. What about `PYSPARK_DRIVER_PYTHON`? – zero323 Oct 09 '18 at 19:19
  • 1
    Update: I ended up reinstalling Spark as versions 2.1.3 and that fixed the issue. I guess in this case minor versions did make a difference. Thanks again for all the help @user6910411! – user3456269 Oct 09 '18 at 19:37

0 Answers0