11

Is there any reference as to what sets of versions are compatible between aws java sdk, hadoop, hadoop-aws bundle, hive, spark?

For example, I know Spark is not compatible with hive versions above Hive 2.1.1

Phantômaxx
  • 37,901
  • 21
  • 84
  • 115
tooptoop4
  • 234
  • 3
  • 15
  • 45

3 Answers3

15

You cannot drop in a later version of the AWS SDK from what which hadoop-aws was built with and expect the s3a connector to work. Ever. That is now written down quite clearly in the S3A troubleshooting docs

Whatever problem you have, changing the AWS SDK version will not fix things, only change the stack traces you see.

This may seem frustrating, given the rate at which the AWS team push out a new SDK, but you have to understand that (a) the API often changes incompatibly between versions (as you have seen), and (b) every release introduces/moves bugs which end up causing problems.

Here is the 3.x timeline of things which broke on updates of the AWS SDK.

Every upgrade of the AWS SDK JAR causes a problem, somewhere. Sometimes an edit to the code and recompile, most commonly: logs filling up with false-alarm messages, dependency problems, threading quirks, etc. Things which can take time to surface.

what you see when you get a hadoop release is not just an aws-sdk JAR which it was compiled against, you get a hadoop-aws JAR which contains the workarounds and fixes for whatever problems that release has introduced and which were identified in the minimum of 4 weeks of testing before the hadoop release ships.

Which is why, no, you shouldn't be changing JARs unless you plan to do a complete end-to-end retest of the s3a client code, including load tests. You are encouraged to do that, the hadoop project always welcomes more testing of our pre-release code, with the Hadoop 3.1 binaries ready to play with. But trying to do it yourself by changing JARs? Sadly, an isolated exercise in pain.

stevel
  • 12,567
  • 1
  • 39
  • 50
  • 1
    Lets say I want to use Spark 2.3.0 (read/write to S3), Hive 2.1.1 (external tables reading from S3) there is no clear matrix of I can use Hadoop vA, AWS SDK vB, hadoop-aws vC or I can use Hadoop vD, AWS SDK vE, hadoop-aws vF ? – tooptoop4 Mar 30 '18 at 02:03
  • On a side note do you know why https://mvnrepository.com/artifact/com.amazonaws/aws-java-sdk/1.7.5 is 11mb but https://mvnrepository.com/artifact/com.amazonaws/aws-java-sdk/1.11.303 is only 30kb with no classes? – tooptoop4 Mar 30 '18 at 11:08
  • AWS went from a JAR with everything, to "an expanded set of interdependent libraries" a while back. Hadoop 3 has embraced the aws-sdk-bundle which has everything in one place and the shaded dependencies (especially jackson) it needs. 50MB, but a consistent 50MB. – stevel Apr 01 '18 at 12:39
  • 2
    regarding versions, hadoop-* JAR need to be consistent. Then your choice of AWS SDK comes out of the hadoop-aws version. Hadoop-common vA => hadoop-aws vA => matching aws-sdk version. The good news: you get to choose what spark version you use FWIW, I like the ASF 2.8.x release chain as stable functionality; 2.7 is underpeformant against S3. – stevel Apr 01 '18 at 12:41
  • Maybe I'm a bit slow. But how does this answer the question? The OP asks what versions are compatible with one another, and you just say that they should be compatible. – pavel_orekhov Jul 15 '23 at 17:44
  • I was trying to make clear that you have to have the exact same version of Hadoop-* everywhere no matter what they are. If you don't do that welcome to the world of random stack traces. If you look for the S3A and spark SO posts, you can see that the "I dropped hadoop-aws-3.3.4 into my spark distro and now I get a stack trace" post comes up about twice a month. people need to know not to mix and match, and then its over to mavenrepository.com to work out the hadoop-aws version and matching aws SDK JAR. – stevel Jul 17 '23 at 14:38
2

Seems like this matrix is sort of available here:

https://hadoop.apache.org/docs/r3.3.6/hadoop-aws/dependency-analysis.html

These are the dependencies compatible with hadoop-aws. As you can see in the url there's "r.3.3.6", you choose whichever version you want. I think you can also write "stable" there and it gives you the latest version.

https://hadoop.apache.org/docs/stable/hadoop-aws/dependency-analysis.html

pavel_orekhov
  • 1,657
  • 2
  • 15
  • 37
0

In Hadoop documentation, it is stated that by adding hadoop-aws JAR to the build dependencies; it will pull in a compatible aws-sdk JAR.

So, I created a dummy Maven project with these dependencies to download the compatible versions

<properties>  
 <!-- Your exact Hadoop version here-->   
 <hadoop.version>3.3.1</hadoop.version>
</properties>

<dependencies>
   <dependency>
    <groupId>org.apache.hadoop</groupId>
    <artifactId>hadoop-client</artifactId>
    <version>${hadoop.version}</version>
   </dependency>
   <dependency>
    <groupId>org.apache.hadoop</groupId>
    <artifactId>hadoop-aws</artifactId>
    <version>${hadoop.version}</version>
   </dependency>
</dependencies>

Then, I checked my dependencies versions, used it in my project and it worked.