2

I'm running an EMR with a spark cluster on AWS. Spark version is 1.6

When running the folllowing command:

proxy = sqlContext.read.load("/user/zeppelin/ProxyRaw.csv", 
                          format="com.databricks.spark.csv", 
                          header="true", 
                          inferSchema="true")

I get the following error:

Py4JJavaError: An error occurred while calling o162.load. : java.lang.ClassNotFoundException: Failed to find data source: com.databricks.spark.csv. Please find packages at http://spark-packages.org at org.apache.spark.sql.execution.datasources.ResolvedDataSource$.lookupDataSource(ResolvedDataSource.scala:77)

How can I solve this? I assume I should add a package but how do I install it and where?

CinCout
  • 9,486
  • 12
  • 49
  • 67
Menkes
  • 391
  • 1
  • 5
  • 18

2 Answers2

5

There is many way to add packages in Zeppelin :

  1. One of them is to actually change the conf/zeppelin-env.sh configuration file adding the packages you need e.g com.databricks:spark-csv_2.10:1.4.0 in your case to the submit options since Zeppelin uses the spark-submit command under the hood :

    export SPARK_SUBMIT_OPTIONS="--packages com.databricks:spark-csv_2.10:1.4.0"
    
  2. But let's say that you don't have actually access to those configuration. You can then use Dynamic Dependency Loading via %dep interpreter (deprecated) :

    %dep
    z.load("com.databricks:spark-csv_2.10:1.4.0")
    

    This will require that you load the dependencies before launching or restarting the interpreter.

  3. Another way to do it is do add the dependency you need via the interpreter dependency manager as described in the following link : Dependency Management for Interpreter.

eliasah
  • 39,588
  • 11
  • 124
  • 154
0

Well,

First you need to download the CSV liv from Maven repository:

https://mvnrepository.com/artifact/com.databricks/spark-csv_2.10/1.5.0

Check the scala version that you are using. If is 2.10 or 2.11.

When you call spark-shell our spark-submit or pyspark. Or even your Zeppelin you need to add the option --jars and the path to your lib.

Like this:

pyspark --jars /path/to/jar/spark-csv_2.10-1.5.0.jar

Than you can call it as you did above.

You can see other close issue here: How to add third party java jars for use in pyspark

Community
  • 1
  • 1
Thiago Baldim
  • 7,362
  • 3
  • 29
  • 51
  • How would you add that for zeppelin using the approach that you are suggesting ? – eliasah Nov 03 '16 at 16:34
  • You can add args to your Zeppelin configuration that you can add the args of --jars like command line. See here: https://zeppelin.apache.org/docs/latest/interpreter/spark.html#configuration – Thiago Baldim Nov 03 '16 at 16:45