0

I need to work on pyspark, to read and write in MongoDB collections. Everything is working fine. I use the below package to start pyspark with MongoDB connection

./pyspark --packages org.mongodb.spark:mongo-spark-connector_2.11:2.2.0

however, the problem is that it is in the command line and it becomes tough to write huge code in the command line. Anyone know how to work in pycharm with same functionalities or how to start a pyspark instance in pycharm with --packages option?

Abhilash
  • 145
  • 3
  • 12

2 Answers2

1

There has been an extensive SO thread on how to configure PyCharm to work with pyspark - see here.

What that thread does not include is how to add external packages, like the MongoDB connector you are interested in; you can do this by adding the following entry to your spark-defaults.conf file, located in $SPARK_HOME/conf:

spark.jars.packages org.mongodb.spark:mongo-spark-connector_2.11:2.2.0

Notice that I am not sure this will work (I suspect not) if you choose to install pyspark with pip (the last option mentioned in the accepted answer of the above thread, for Spark >= 2.2). Personally, I do not recommend installing pyspark with pip since, as mentioned in the docs,

The Python packaging for Spark is not intended to replace all of the other use cases. This Python packaged version of Spark is suitable for interacting with an existing cluster (be it Spark standalone, YARN, or Mesos) - but does not contain the tools required to setup your own standalone Spark cluster.

desertnaut
  • 57,590
  • 26
  • 140
  • 166
  • 1
    adding `spark.jars.packages org.mongodb.spark:mongo-spark-connector_2.11:2.2.0` to `spark-defaults.conf`worked in pycharm. Thankyou very much. – Abhilash Oct 15 '17 at 13:27
1

Adding mongo-spark-connector to spark.jars.packages in $SPARK_HOME/conf works, as mentioned by @desertnaut. But this config can be added to spark session as well if you wondering here is the code for that in pyspark.

spark: SparkSession = SparkSession \
    .builder \
    .appName("MyApp") \
    .config("spark.mongodb.input.uri", "mongodb://localhost:27017/db.collection") \
    .config("spark.mongodb.output.uri", "mongodb://localhost:27017/db.collection") \
    .config("spark.jars.packages", "org.mongodb.spark:mongo-spark-connector_2.12:3.0.1") \
    .master("local") \
    .getOrCreate()
Pardeep
  • 945
  • 10
  • 18