use an external library in pyspark job in a Spark cluster from google-dataproc

Question

I have a spark cluster I created via google dataproc. I want to be able to use the csv library from databricks (see https://github.com/databricks/spark-csv). So I first tested it like this:

I started a ssh session with the master node of my cluster, then I input:

pyspark --packages com.databricks:spark-csv_2.11:1.2.0

Then it launched a pyspark shell in which I input:

df = sqlContext.read.format('com.databricks.spark.csv').options(header='true', inferschema='true').load('gs:/xxxx/foo.csv')
df.show()

And it worked.

My next step is to launch this job from my main machine using the command:

gcloud beta dataproc jobs submit pyspark --cluster <my-dataproc-cluster> my_job.py

But here It does not work and I get an error. I think because I did not gave the --packages com.databricks:spark-csv_2.11:1.2.0 as an argument, but I tried 10 different ways to give it and I did not manage.

My question are:

was the databricks csv library installed after I typed pyspark --packages com.databricks:spark-csv_2.11:1.2.0
can I write a line in my job.py in order to import it?
or what params should I give to my gcloud command to import it or install it?

There's a bug in Dataproc where JARS are not being picked up for Pyspark jobs. I am looking into an alternative solution. I just wanted to let you know that we're looking at the larger bug and I am seeing if we can ID an interim fix for you as well. :) — James, Oct 28 '15 at 00:17
hoping for both a workaround and a fix here too, thx @James! we're trying to use dataproc with the cassandra connector from both python and scala — navicore, Oct 28 '15 at 14:27

score 13 · Accepted Answer · answered Oct 29 '15 at 00:49

Short Answer

There are quirks in ordering of arguments where --packages isn't accepted by spark-submit if it comes after the my_job.py argument. To workaround this, you can do the following when submitting from Dataproc's CLI:

gcloud beta dataproc jobs submit pyspark --cluster <my-dataproc-cluster> \
    --properties spark.jars.packages=com.databricks:spark-csv_2.11:1.2.0 my_job.py

Basically, just add --properties spark.jars.packages=com.databricks:spark-csv_2.11:1.2.0 before the .py file in your command.

Long Answer

So, this is actually a different issue than the known lack of support for --jars in gcloud beta dataproc jobs submit pyspark; it appears that without Dataproc explicitly recognizing --packages as a special spark-submit-level flag, it tries to pass it after the application arguments so that spark-submit lets the --packages fall through as an application argument rather than properly parsing it as a submission-level option. Indeed, in an SSH session, the following does not work:

# Doesn't work if job.py depends on that package.
spark-submit job.py --packages com.databricks:spark-csv_2.11:1.2.0

But switching the order of the arguments does work again, even though in the pyspark case, both orderings work:

# Works with dependencies on that package.
spark-submit --packages com.databricks:spark-csv_2.11:1.2.0 job.py
pyspark job.py --packages com.databricks:spark-csv_2.11:1.2.0
pyspark --packages com.databricks:spark-csv_2.11:1.2.0 job.py

So even though spark-submit job.py is supposed to be a drop-in replacement for everything that previously called pyspark job.py, the difference in parse ordering for things like --packages means it's not actually a 100% compatible migration. This might be something to follow up with on the Spark side.

Anyhow, fortunately there's a workaround, since --packages is just another alias for the Spark property spark.jars.packages, and Dataproc's CLI supports properties just fine. So you can just do the following:

gcloud beta dataproc jobs submit pyspark --cluster <my-dataproc-cluster> \
    --properties spark.jars.packages=com.databricks:spark-csv_2.11:1.2.0 my_job.py

Note that the --properties must come before the my_job.py, otherwise it gets sent as an application argument rather than as a configuration flag. Hope that works for you! Note that the equivalent in an SSH session would be spark-submit --packages com.databricks:spark-csv_2.11:1.2.0 job.py.

This helped me, but I am now struggling to register a new repository in addition to my package. I have tried adding ``--properties spark.jars.packages=org.elasticsearch:elasticsearch-hadoop:2.4.0,spark.jars.ivy=http://conjars.org/repo`` but somehow the two forward slashes get converted into one, and the driver errors out via the below. Do you have any thoughts on this error / the proper way to supply a fully qualified url with two forward slashes: ``Exception in thread "main" java.lang.IllegalArgumentException: basedir must be absolute: http:/conjars.org/repo/local`` — aeneaswiener, Nov 15 '16 at 22:35

cerisier · Answer 2 · 2016-11-16T23:22:26.133

5

Additionally to @Dennis.

Note that if you need to load multiple external packages, you need to specify a custom escape character like so:

--properties ^#^spark.jars.packages=org.elasticsearch:elasticsearch-spark_2.10:2.3.2,com.data‌bricks:spark-avro_2.10:2.0.1

Note the ^#^ right before the package list. See gcloud topic escaping for more details.

edited Nov 16 '16 at 23:22

answered Jul 27 '16 at 13:44

cerisier

1,161
1
13
13

use an external library in pyspark job in a Spark cluster from google-dataproc

2 Answers2

Linked