1

I'm using spark-redshift (https://github.com/databricks/spark-redshift) which uses avro for transfer.

Reading from Redshift is OK, while writing I'm getting

Caused by: java.lang.NoSuchMethodError: org.apache.avro.generic.GenericData.createDatumWriter(Lorg/apache/avro/Schema;)Lorg/apache/avro/io/DatumWriter

tried using Amazon EMR 4.1.0 (Spark 1.5.0) and 4.0.0 (Spark 1.4.1). Cannot do

import org.apache.avro.generic.GenericData.createDatumWriter

either, just

import org.apache.avro.generic.GenericData

I'm using scala shell Tried download several others avro-mapred and avro jars, tried setting

{"classification":"mapred-site","properties":{"mapreduce.job.user.classpath.first":"true"}},{"classification":"spark-env","properties":{"spark.executor.userClassPathFirst":"true","spark.driver.userClassPathFirst":"true"}}

and adding those jars to spark classpath. Possibly need to tune Hadoop (EMR) somehow.

Does this ring a bell to anyone?

devopslife
  • 668
  • 1
  • 9
  • 21

4 Answers4

1

spark-redshift maintainer here.

Other EMR users have encountered similar errors when using newer versions of the spark-avro library (which spark-redshift depends on). In a nutshell, the problem seems to be that EMR's older version of Avro takes precedence over the new version required by spark-avro. At https://github.com/databricks/spark-avro/issues/91, an issue that seems to match the exception reported here, one user suggested embedding the Avro JARs with their application code: https://github.com/databricks/spark-avro/issues/91#issuecomment-142543149

Josh Rosen
  • 13,511
  • 6
  • 58
  • 70
  • That's helpful but I'm not using Java but scala, scala-shell at the moment. So trying to figure out how to make "spark.driver.userClassPathFirst":"true" work. Any idea how to remove the old avro jar from EMR? – devopslife Oct 16 '15 at 13:53
  • Unfortunately, I'm not an EMR user myself. I'd suggest posting this question on the `spark-redshift` thread that I linked, since one of the other users may know how to do this. – Josh Rosen Oct 16 '15 at 17:11
  • Thanks, it's not good though that spark-redshift is so hard to get working on EMR being it the same cloud as Redshift. – devopslife Oct 16 '15 at 19:38
  • meaning there is a good chance people use spark-redshift on EMR – devopslife Oct 16 '15 at 21:43
  • @devopslife: I agree. If there is a fix that we can do in `spark-redshift` itself to make it easier to use on EMR, then I'm all for it. I worry, however, that the problem here is not specific to `spark-redshift` but, rather, is an instance of a more general issue with Redshift-provided Avro dependencies. – Josh Rosen Oct 17 '15 at 03:49
1

Jonathan from EMR here. Part of the problem is that Hadoop depends upon Avro 1.7.4, and the full Hadoop classpath is included in the Spark path on EMR. It might help for us to upgrade Hadoop's Avro dependency to 1.7.7 so that it matches with Spark's Avro dependency, though I'm a little afraid that this might break something else, but I can try it out anyway.

BTW, one problem I noticed with your example EMR cluster config is that you're using the "spark-env" config classification, whereas the "spark-defaults" classification would be the appropriate one for setting spark.{driver,executor}.userClassPathFirst. I'm not sure this by itself would solve your problem though.

Jonathan Kelly
  • 1,940
  • 11
  • 14
1

just for reference - workaround by Alex Nastetsky

delete jars from master node

find / -name "*avro*jar" 2> /dev/null -print0 | xargs -0 -I file sudo rm file

delete jars from slave nodes

yarn node -list | sed 's/ .*//g' | tail -n +3 | sed 's/:.*//g' | xargs -I node ssh node "find / -name "*avro*jar" 2> /dev/null -print0 | xargs -0 -I file sudo rm file

Setting configs correctly as proposed by Jonathan is worth a shot too.

devopslife
  • 668
  • 1
  • 9
  • 21
  • 1
    Slight bug in the second command in the `find` clause - should use single quotes inside double quotes. – jbrown Nov 19 '15 at 12:56
0

A runntime conflict error in EMR related to Avro is very common. Avro is widely used and a lot of jars have it as a dependancy. I saw few variations of this question with different method in the 'NoSuchMethodError' or different Avro versions.

I failed to solve it with 'spark.executor.userClassPathFirst' flag, because I got LinkageError.

Here is the solution which solved the conflict for me:

  1. Use Intellij's Dependancy Analyzer (Maven plugin) to exclude Avro from all dependancies which cause conflict.
  2. When setting the EMR, add a bootstrap action which calls a bash script that download the specific Avro JAR:

    #!/bin/bash

    mkdir -p /home/hadoop/lib/
    cd /home/hadoop/lib/
    wget http://apache.spd.co.il/avro/avro-1.8.0/java/avro-1.8.0.jar
    
  3. When setting the EMR, add the following configuration:

    [
    {"classification":"spark-defaults", "properties":{
    "spark.driver.extraLibraryPath":"/home/hadoop/lib/avro-1.8.0.jar:/usr/lib/hadoop/*:/usr/lib/hadoop/../hadoop-hdfs/*:/usr/lib/hadoop/../hadoop-mapreduce/*:/usr/lib/hadoop/../hadoop-yarn/*:/etc/hive/conf:/usr/lib/hadoop/../hadoop-lzo/lib/*:/usr/share/aws/emr/emrfs/conf:/usr/share/aws/emr/emrfs/lib/*:/usr/share/aws/emr/emrfs/auxlib/*", 
    "spark.executor.extraClassPath":"/home/hadoop/lib/avro-1.8.0.jar:/usr/lib/hadoop/*:/usr/lib/hadoop/../hadoop-hdfs/*:/usr/lib/hadoop/../hadoop-mapreduce/*:/usr/lib/hadoop/../hadoop-yarn/*:/etc/hive/conf:/usr/lib/hadoop/../hadoop-lzo/lib/*:/usr/share/aws/emr/emrfs/conf:/usr/share/aws/emr/emrfs/lib/*:/usr/share/aws/emr/emrfs/auxlib/*", 
    "spark.driver.extraClassPath":"/home/hadoop/lib/avro-1.8.0.jar:/usr/lib/hadoop/*:/usr/lib/hadoop/../hadoop-hdfs/*:/usr/lib/hadoop/../hadoop-mapreduce/*:/usr/lib/hadoop/../hadoop-yarn/*:/etc/hive/conf:/usr/lib/hadoop/../hadoop-lzo/lib/*:/usr/share/aws/emr/emrfs/conf:/usr/share/aws/emr/emrfs/lib/*:/usr/share/aws/emr/emrfs/auxlib/*"}, 
    "configurations":[]}
    ]
    

As you can see, I had to add my new library WITH the existing libraries. It didn't work otherwise.

user3487888
  • 858
  • 7
  • 6