1

I was reading through this post, https://nycdatascience.com/blog/student-works/yelp-recommender-part-2/, and followed basically everything they showed. However, after reading this post, Spark 2.1 Structured Streaming - Using Kakfa as source with Python (pyspark), when I run

SPARK_HOME/bin/spark-submit read_stream_spark.py --master local[4] --jars spark-sql-kafka-0.10_2.11-2.1.0.jar

I still get the error that 'Failed to find data source: kafka'.

I also read through this. https://spark.apache.org/docs/latest/structured-streaming-kafka-integration.html. The official doc ask for two hosts and two ports while I only use one. Should I specify another host and port other than cloud server and the kafka port? Thanks.

Could you please let me know what I am missing. Or I shouldn't have run the script alone?

OneCricketeer
  • 179,855
  • 19
  • 132
  • 245
May Y
  • 179
  • 1
  • 20

1 Answers1

0

The official doc ask for two hosts and two ports

That's not related to your error. A minimum of one bootstrap server is required.

You need to move your Python file to the end of the command, otherwise all of the options you provided are given as command line arguments to the Python script, not spark-submit. And therefore it's using the default master with no external jars.

It's also recommended that you use --packages since this should ensure transitive dependencies are included with the submission

OneCricketeer
  • 179,855
  • 19
  • 132
  • 245