2

I'm using the following command to execute a pyspark script:

spark-submit \
  --packages org.apache.spark:spark-streaming-kafka-0-8_2.11:2.4.0 \
  pyspark_streaming/spark_streaming.py

The spark_streaming.py script is as follows:

from pyspark import SparkContext
from pyspark.streaming import StreamingContext
from pyspark.streaming.kafka import KafkaUtils
import json

if __name__ == "__main__":
    sc = SparkContext('local[*]')
    ssc = StreamingContext(sc, 10)
    kafkaStream = KafkaUtils.createStream(ssc, 'localhost:2181', 'spark-streaming', {'twitter':1})
    parsed = kafkaStream.map(lambda v: json.loads(v[1]))
    user_counts = parsed.map(lambda tweet: (tweet['user']["screen_name"], 1)).reduceByKey(lambda x,y: x + y)
    user_counts.pprint()
    ssc.start()
    ssc.awaitTermination()

That gives the following error:

ubuntu@ubuntu:/mnt/vmware_shared_folder$ spark-submit --packages org.apache.spark:spark-streaming-kafka-0-8_2.11:2.4.0 pyspark_streaming/spark_streaming.py
Ivy Default Cache set to: /home/ubuntu/.ivy2/cache
The jars for the packages stored in: /home/ubuntu/.ivy2/jars
:: loading settings :: url = jar:file:/home/ubuntu/spark-2.4.0-bin-hadoop2.7/jars/ivy-2.4.0.jar!/org/apache/ivy/core/settings/ivysettings.xml
org.apache.spark#spark-streaming-kafka-0-8_2.11 added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent-e1001739-51dd-43bc-83b7-52a7484328f1;1.0
    confs: [default]
    found org.apache.spark#spark-streaming-kafka-0-8_2.11;2.4.0 in central
    found org.apache.kafka#kafka_2.11;0.8.2.1 in local-m2-cache
    found org.scala-lang.modules#scala-xml_2.11;1.0.2 in local-m2-cache
    found com.yammer.metrics#metrics-core;2.2.0 in local-m2-cache
    found org.slf4j#slf4j-api;1.7.16 in local-m2-cache
    found org.scala-lang.modules#scala-parser-combinators_2.11;1.1.0 in central
    found com.101tec#zkclient;0.3 in local-m2-cache
    found log4j#log4j;1.2.17 in local-m2-cache
    found org.apache.kafka#kafka-clients;0.8.2.1 in local-m2-cache
    found net.jpountz.lz4#lz4;1.2.0 in local-m2-cache
    found org.xerial.snappy#snappy-java;1.1.7.1 in local-m2-cache
    found org.spark-project.spark#unused;1.0.0 in local-m2-cache
:: resolution report :: resolve 3367ms :: artifacts dl 142ms
    :: modules in use:
    com.101tec#zkclient;0.3 from local-m2-cache in [default]
    com.yammer.metrics#metrics-core;2.2.0 from local-m2-cache in [default]
    log4j#log4j;1.2.17 from local-m2-cache in [default]
    net.jpountz.lz4#lz4;1.2.0 from local-m2-cache in [default]
    org.apache.kafka#kafka-clients;0.8.2.1 from local-m2-cache in [default]
    org.apache.kafka#kafka_2.11;0.8.2.1 from local-m2-cache in [default]
    org.apache.spark#spark-streaming-kafka-0-8_2.11;2.4.0 from central in [default]
    org.scala-lang.modules#scala-parser-combinators_2.11;1.1.0 from central in [default]
    org.scala-lang.modules#scala-xml_2.11;1.0.2 from local-m2-cache in [default]
    org.slf4j#slf4j-api;1.7.16 from local-m2-cache in [default]
    org.spark-project.spark#unused;1.0.0 from local-m2-cache in [default]
    org.xerial.snappy#snappy-java;1.1.7.1 from local-m2-cache in [default]
    ---------------------------------------------------------------------
    |                  |            modules            ||   artifacts   |
    |       conf       | number| search|dwnlded|evicted|| number|dwnlded|
    ---------------------------------------------------------------------
    |      default     |   12  |   0   |   0   |   0   ||   12  |   0   |
    ---------------------------------------------------------------------
:: retrieving :: org.apache.spark#spark-submit-parent-e1001739-51dd-43bc-83b7-52a7484328f1
    confs: [default]
    0 artifacts copied, 12 already retrieved (0kB/88ms)
2019-10-29 04:57:49 WARN  Utils:66 - Your hostname, ubuntu resolves to a loopback address: 127.0.1.1; using 192.168.10.138 instead (on interface ens33)
2019-10-29 04:57:50 WARN  Utils:66 - Set SPARK_LOCAL_IP if you need to bind to another address
Traceback (most recent call last):
  File "/mnt/vmware_shared_folder/pyspark_streaming/spark_streaming.py", line 8, in <module>
    from pyspark import SparkContext
ImportError: cannot import name 'SparkContext' from 'pyspark' (/mnt/vmware_shared_folder/pyspark_streaming/pyspark.py)
2019-10-29 04:57:55 INFO  ShutdownHookManager:54 - Shutdown hook called
2019-10-29 04:57:55 INFO  ShutdownHookManager:54 - Deleting directory /tmp/spark-0ad03060-792a-42cb-81b3-62fa69d66463

I've tried the code from the script in pyspark shell and it runs fine.

I've added the following to bashrc file

export SPARK_HOME=/home/ubuntu/spark-2.4.0-bin-hadoop2.7
export PYTHONPATH=$SPARK_HOME/python/:$SPARK_HOME/python/lib/py4j-0.10.7-src.zip:$PYTHONPATH
export PYSPARK_PYTHON=python3

Any help would be appreciated

g grey
  • 21
  • 3
  • See https://stackoverflow.com/questions/43126547/unable-to-import-sparkcontext – thebluephantom Oct 27 '19 at 09:38
  • @thebluephantom tried the recommendations in that question none work. – g grey Oct 27 '19 at 11:34
  • Can you start `pyspark` and see if you can `from pyspark import SparkContext` followed by `sc = SparkContext('local[*]')`. Could it be `python3`-related? – Jacek Laskowski Oct 27 '19 at 19:56
  • @JacekLaskowski I can run `from pyspark import SparkContext` but `sc = SparkContext('local[*]')` gives me `ValueError: Cannot run multiple SparkContexts at once; existing SparkContext(app=PySparkShell, master=yarn) created by at /home/ubuntu/spark-2.4.0-bin-hadoop2.7/python/pyspark/shell.py:41 ` – g grey Oct 28 '19 at 10:50
  • That's expected and seems like `pyspark` at least works fine. Can you remove all the code from `spark_streaming.py` and leave just the import and `sc = SparkContext('local[*]')`. Does that `spark-submit` properly? – Jacek Laskowski Oct 28 '19 at 14:08
  • @JacekLaskowski it gives same `ImportError: cannot import name 'SparkContext' ` – g grey Oct 28 '19 at 16:15
  • Can you edit the question and show the exact pyspark app you execute? And the exact command. `pyspark` works but the python app does not, right? – Jacek Laskowski Oct 28 '19 at 16:36
  • Why do you define `PYTHONPATH` env var? What happens without it? It worked for me (but I haven't tested it out much). – Jacek Laskowski Oct 28 '19 at 16:40
  • @JacekLaskowski the error changes to `ImportError: cannot import name 'SparkContext' from 'pyspark' (/pyspark_streaming/pyspark.py)` – g grey Oct 28 '19 at 16:51
  • Can you edit your question with the very basic python script and how you execute it using `spark-submit`? Can you also do `type spark-submit`? Thanks – Jacek Laskowski Oct 29 '19 at 09:30
  • Can you also paste the entire output from `spark-submit` from the beginning to the error? Thanks. – Jacek Laskowski Oct 29 '19 at 09:39
  • @JacekLaskowski added the entire output of spark-submit & the output of `type spark-submit` is `spark-submit is hashed (/home/ubuntu/anaconda3/bin/spark-submit) ` – g grey Oct 29 '19 at 12:10
  • What's this `/home/ubuntu/anaconda3/bin/spark-submit`?! Shouldn't it be `/home/ubuntu/spark-2.4.0-bin-hadoop2.7/bin/spark-submit`? – Jacek Laskowski Oct 29 '19 at 15:59
  • @JacekLaskowski some issue w the spark version setup latest version and it now outputs `spark-submit is /home/ubuntu/.local/bin/spark-submit`, error presists though – g grey Oct 29 '19 at 18:02

0 Answers0