3

I have set up my PyCharm to link with my local spark installation as per in this link

from pyspark import SparkContext, SQLContext, SparkConf
from operator import add
conf = SparkConf()
conf.setMaster("spark://localhost:7077")
conf.setAppName("Test")
sc = SparkContext(conf=conf)
sqlContext = SQLContext(sc)
df = sqlContext.createDataFrame([(2012, 8, "Batman", 9.8), (2012, 8, "Hero", 8.7), (2012, 7, "Robot", 5.5), (2011, 7, "Git", 2.0)],["year", "month", "title", "rating"])
df.write.mode('overwrite').format("com.databricks.spark.avro").save("file:///Users/abhattac/PycharmProjects/WordCount/users")

This requires Databrick's avro jar to be shipped to worker node. I can get it done using spark-submit from shell like the following:

/usr/local/Cellar/apache-spark/1.6.1/bin/pyspark AvroFile.py --packages com.databricks:spark-avro_2.10:2.0.1

I couldn't find out how to provide --packages option when I am running it from inside PyCharm IDE. Any help will be appreciated.

Community
  • 1
  • 1
user3138594
  • 209
  • 3
  • 9

1 Answers1

1

You can use Python PYSPARK_SUBMIT_ARGS environment variable, either by passing it using environment variables section of the PyCharm run configuration (the same place where you set SPARK_HOME)

enter image description here

or using os.environ directly in your code as shown in load external libraries inside pyspark code

Community
  • 1
  • 1
zero323
  • 322,348
  • 103
  • 959
  • 935
  • 1
    Thanks for the answer. I tried the following: os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages com.databricks:spark-avro_2.10:2.0.1' That didn't work and kept giving java.lang.ClassNotFoundException: Failed to find data source: com.databricks.spark.avro Then I included these two lines in spark-defaults.conf spark.driver.extraClassPath /Users/abhattac/PycharmProjects/WordCount/spark-avro_2.10-2.0.1.jar spark.executor.extraClassPath /Users/abhattac/PycharmProjects/WordCount/spark-avro_2.10-2.0.1.jar With those lines I could make it work. – user3138594 Mar 17 '16 at 17:46
  • The original stackoverflow link: http://stackoverflow.com/questions/31464845/automatically-including-jars-to-pyspark-classpath – user3138594 Mar 17 '16 at 17:53
  • Missing `pyspark-shell`? – zero323 Mar 17 '16 at 17:59
  • @user3138594 - In windows, how to specify the path? if i need to add multiple jar's, then how to add them? – Induraj PR Mar 03 '21 at 16:38