4

I'm trying to read avro files in pyspark. Found out from How to read Avro file in PySpark that spark-avro is the best way to do that but I can't figure out how to install that from their Github repo. There's no downloadable jar, do I build it myself? How?

It's Spark 1.6 (pyspark) running on a cluster. I didn't set it up so don't know much about the configs but I have sudo access so I guess I should be able to install stuff. But the machine doesn't have direct internet access so need to manually copy and install stuff to it.

Thank you.

Community
  • 1
  • 1
noobman
  • 75
  • 1
  • 7

2 Answers2

7

You can add spark-avro as a package when running pyspark or spark-submit: https://github.com/databricks/spark-avro#with-spark-shell-or-spark-submit but this will require internet access on driver (driver will then distribute all files to the executors).

If you have no internet access on a driver you will need to build spark-avro yourself to a fat jar:

git clone https://github.com/databricks/spark-avro.git
cd spark-avro
# If you are using spark package other than newest, 
# checkout appropriate tag based on table in spark-avro README, 
# for example for spark 1.6:
# git checkout v2.0.1 
./build/sbt assembly

Then test it using pyspark shell:

./bin/pyspark --jars ~/git/spark-avro/target/scala-2.11/spark-avro-assembly-3.1.0-SNAPSHOT.jar

>>> spark.range(10).write.format("com.databricks.spark.avro").save("/tmp/output")
>>> spark.read.format("com.databricks.spark.avro").load("/tmp/output").show()
+---+
| id|
+---+
|  7|
|  8|
|  9|
|  2|
|  3|
|  4|
|  0|
|  1|
|  5|
|  6|
+---+
Mariusz
  • 13,481
  • 3
  • 60
  • 64
  • How do I build a fat jar for spark-avro? – noobman Nov 17 '16 at 06:29
  • Clone `spark-avro` repository and run `build/sbt assembly` – Mariusz Nov 17 '16 at 06:49
  • It says 'Attempting to fetch sbt Our attempt to download sbt locally to build/sbt-launch-0.13.11.jar failed. Please install sbt manually from http://www.scala-sbt.org/` – noobman Nov 17 '16 at 07:07
  • Build the fat jar on system that is connected to internet. Then copy file to driver and start spark with `--jars`. – Mariusz Nov 17 '16 at 07:09
  • Stuck at `Getting org.scala-sbt sbt 0.13.11 ... downloading https://repo1.maven.org/maven2/org/scala-lang/scala-library/2.10.6/scala-library-2.10.6.jar ...` for a long time... my internet is fine otherwise – noobman Nov 17 '16 at 07:40
  • Got the jar and tried to run again but got this `Py4JJavaError: An error occurred while calling o25.load. : java.lang.ClassNotFoundException: Failed to find data source: com.databricks.spark.avro. Please use Spark package http://spark-packages.org/package/databricks/spark-avro` – noobman Nov 17 '16 at 08:14
  • Are you sure you include `--jars path/to/fat.jar` in command? – Mariusz Nov 17 '16 at 08:54
  • Yeah, I added the jar from the target folder and that's when the error changed to what I put in my previous comment. Before adding that jar it was some other error. – noobman Nov 19 '16 at 03:35
  • I just tested the solution with spark-2.0.1 and spark-avro from master and edited comment to give you more details. Please ensure your steps matches instructions. – Mariusz Nov 20 '16 at 18:42
  • I got it to work by building the fat jar on the 'branch-2.0' branch because I have an older spark version. mucho gracias! thanks for all your help! – noobman Nov 21 '16 at 05:31
  • It's also a good idea to move the created jar file to jars directory `sudo mv /home/hadoop/spark-avro/target/scala-2.11/spark-avro-assembly-4.1.0-SNAPSHOT.jar $SPARK_HOME/jars/` – Paul Bendevis Dec 16 '18 at 07:18
0

Should be possible with

wget https://repo1.maven.org/maven2/org/apache/spark/spark-avro_2.11/${SPARK_VERSION}/spark-avro_2.11-${SPARK_VERSION}.jar -P $SPARK_HOME/jars/                                                        

echo spark.executor.extraClassPath $SPARK_HOME/jars/spark-avro_2.11-$SPARK_VERSION.jar >>  /usr/local/spark/conf/spark-defaults.conf                                                                    

echo spark.driver.extraClassPath $SPARK_HOME/jars/spark-avro_2.11-$SPARK_VERSION.jar >>  /usr/local/spark/conf/spark-defaults.conf

ewianda
  • 61
  • 1
  • 6