4

I have Spark 1.6 deployed on EMR 4.4.0 I am connecting to datastax cassandra 2.2.5 deployed on EC2.

The connection works to save data into cassandra using spark-connector 1.4.2_s2.10 (Since it has guava 14) However reading data from cassandra fails using the 1.4.2 version of connector.

The right combination suggests to use 1.5.x and hence I started using 1.5.0. First I faced the guava problem and I fixed it using userClasspathFirst solution.

spark-shell --conf spark.yarn.executor.memoryOverhead=2048 
--packages datastax:spark-cassandra-connector:1.5.0-s_2.10 
--conf spark.cassandra.connection.host=10.236.250.96 
--conf spark.executor.extraClassPath=/home/hadoop/lib/guava-16.0.1.jar:/etc/hadoop/conf:/etc/hive/conf:/usr/lib/hadoop-lzo/lib/*:/usr/share/aws/aws-java-sdk/*:/usr/share/aws/emr/emrfs/conf:/usr/share/aws/emr/emrfs/lib/*:/usr/share/aws/emr/emrfs/auxlib/* 
--conf spark.driver.extraClassPath=/home/hadoop/lib/guava-16.0.1.jar:/etc/hadoop/conf:/etc/hive/conf:/usr/lib/hadoop-lzo/lib/*:/usr/share/aws/aws-java-sdk/*:/usr/share/aws/emr/emrfs/conf:/usr/share/aws/emr/emrfs/lib/*:/usr/share/aws/emr/emrfs/auxlib/* 
--conf spark.driver.userClassPathFirst=true 
--conf spark.executor.userClassPathFirst=true

Now I get past Guava 16 error, however since I am using the userClassPathFirst i am facing another conflict, and I am not getting any way to resolve it.

Lost task 2.1 in stage 2.0 (TID 6, ip-10-187-78-197.ec2.internal): java.lang.LinkageError: 
loader constraint violation: loader (instance of org/apache/spark/util/ChildFirstURLClassLoader) previously initiated loading for a different type with name "org/slf4j/Logger"

I am having the same trouble when I repeast the steps using Java code instead of spark-shell. Any solution to get past it, or any other cleaner way?

Thanks!

lazywiz
  • 1,091
  • 2
  • 13
  • 26
  • I'm actually having the exact same issue. – LiMuBei Apr 06 '16 at 15:15
  • We got the root cause, it's some conflicting dependency . When I removed all dependencies from the package and had just the absolutely necessary ones the problem disappeared. I am not able to pin point the exact package causing the dependency conflict but for sure its there root cause. We re-architectured the code to split up into two package, one with all business logic and another very lean just to do spark jobs. – lazywiz Apr 07 '16 at 00:36
  • 1
    Well, my only dependency aside from Spark is the Cassandra connector and I still got the error. To me it looks like the issue described here: http://techblog.applift.com/upgrading-spark So basically the two class loaders colliding for some reason. – LiMuBei Apr 07 '16 at 07:16
  • 1
    Turns out I was actually running into this issue: https://issues.apache.org/jira/browse/SPARK-10910 – LiMuBei Apr 07 '16 at 09:22
  • hi @lazywiz - could you describe how you got spark on emr working with titan/cassandra ? I'm absolutely stuck here and dont even know how to begin - most doc seems to imply that cassandra and spark must exist on the same cluster. If you have any config files/scripts that you can share - that would be really awesome. thanks! – Sandeep Jul 31 '16 at 10:03

1 Answers1

0

I got the same error when using the 'userClassPathFirst' flag.

Remove these 2 flags from configuration, and just use the 'extraClassPath' paramter.

Detailed answer here: https://stackoverflow.com/a/40235289/3487888

Community
  • 1
  • 1
user3487888
  • 858
  • 7
  • 6