8

This page was inspiring me to try out spark-csv for reading .csv file in PySpark I found a couple of posts such as this describing how to use spark-csv

But I am not able to initialize the ipython instance by including either the .jar file or package extension in the start-up that could be done through spark-shell.

That is, instead of

ipython notebook --profile=pyspark

I tried out

ipython notebook --profile=pyspark --packages com.databricks:spark-csv_2.10:1.0.3

but it is not supported.

Please advise.

zero323
  • 322,348
  • 103
  • 959
  • 935
KarthikS
  • 883
  • 1
  • 11
  • 17

2 Answers2

18

You can simply pass it in the PYSPARK_SUBMIT_ARGS variable. For example:

export PACKAGES="com.databricks:spark-csv_2.11:1.3.0"
export PYSPARK_SUBMIT_ARGS="--packages ${PACKAGES} pyspark-shell"

These property can be also set dynamically in your code before SparkContext / SparkSession and corresponding JVM have been started:

packages = "com.databricks:spark-csv_2.11:1.3.0"

os.environ["PYSPARK_SUBMIT_ARGS"] = (
    "--packages {0} pyspark-shell".format(packages)
)
zero323
  • 322,348
  • 103
  • 959
  • 935
  • Wouldn't this override everything that is already in `os.environ["PYSPARK_SUBMIT_ARGS"]` ? I think this needs to be mentioned cause I've spent a lot of time figuring what happened – David Arenburg Nov 28 '16 at 08:56
  • This is not working for Kafka. I am still getting below error: `java.lang.ClassNotFoundException: Failed to find data source: kafka.` Code: `import os os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages org.apache.spark:spark-sql-kafka-0-10_2.11:2.2.0 pyspark-shell'` – Hemant Chandurkar Jul 09 '18 at 11:29
11

I believe you can also add this as a variable to your spark-defaults.conf file. So something like:

spark.jars.packages    com.databricks:spark-csv_2.10:1.3.0

This will load the spark-csv library into PySpark every time you launch the driver.

Obviously zero's answer is more flexible because you can add these lines to your PySpark app before you import the PySpark package:

import os
os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages com.databricks:spark-csv_2.10:1.3.0 pyspark-shell'

from pyspark import SparkContext, SparkConf

This way you are only importing the packages you actually need for your script.

Disco4Ever
  • 1,043
  • 2
  • 11
  • 16
  • 2
    If you are running a notebook, this is by far the most portable option: I'm running the all-spark-notebook version, and this unlocks CSV parsing for all three languages at once. – mrArias Nov 21 '16 at 14:02
  • i am trying to import package mmlspark. Using the following in my notebook. But getting the error mmlspark not found import os import sys os.environ["PYSPARK_SUBMIT_ARGS"] = \ "--packages Azure:mmlspark:0.13 \ pyspark-shell" import findspark findspark.add_packages(["Azure:mmlspark:0.13"]) findspark.init() – Naveenan Jul 17 '18 at 22:00