Unable to read data from AWS S3 in Java Dataset for Spark

Question

I am trying to read data from aws s3 into dataset/rdd in Java but getting Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/hadoop/fs/StreamCapabilities. I am running Spark code in Java on IntelliJ, so added Hadoop dependencies as well in pom.xml

Below is my code and pom.xml file.

import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.sql.SparkSession;
import org.apache.spark.api.java.JavaSparkContext;

public class SparkJava {

    public static void main(String[] args){

        SparkSession spark  = SparkSession
                .builder()
                .master("local")
                .config("spark.hadoop.fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")                  .config("spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version", "2")
                .config("fs.s3n.awsAccessKeyId", AWS_KEY)
                .config("fs.s3n.awsSecretAccessKey", AWS_SECRET_KEY)
                .getOrCreate();

        JavaSparkContext sc = new JavaSparkContext(spark.sparkContext());
        String input_path = "s3a://bucket/2018/07/28/zqa.parquet";
        Dataset<Row> dF = spark.read().load(input_path); // THIS LINE CAUSES ERROR

    }
}

Here are the dependencies from pom.xml

<dependencies>
    <dependency>
        <groupId>org.apache.spark</groupId>
        <artifactId>spark-core_2.11</artifactId>
        <version>2.3.1</version>
    </dependency>
    <dependency>
        <groupId>org.apache.spark</groupId>
        <artifactId>spark-sql_2.11</artifactId>
        <version>2.3.1</version>
    </dependency>
    <dependency>
        <groupId>org.apache.hadoop</groupId>
        <artifactId>hadoop-aws</artifactId>
        <version>3.1.1</version>
    </dependency>
    <dependency>
        <groupId>org.apache.hadoop</groupId>
        <artifactId>hadoop-hdfs</artifactId>
        <version>3.1.1</version>
    </dependency>
</dependencies>

Any help will be really appreciated. Thanks in advance!

Have you set the relevant environment variables i.e. HADOOP_HOME, SPARK_DIST_CLASSPATH etc. If yes, could you please share what you've set them to? (I suppose there are quite a few that we'll need to set before doing this ). — Lalit, Sep 26 '18 at 22:35
@Lalit: Thanks for your reply. How do you set your class path? Does specifying dependencies doesn't suffice? When you say HADOOP_HOME, do I have to install Hadoop? Could you please tell me step by step instructions. Sorry, I am new to Java, coming from Python world. — Atihska, Sep 26 '18 at 22:39
Just the dependencies is not going to enough (joys of hadoop, I've experienced a lot :)). Try the response with maximum votes in this post and see if it works - https://stackoverflow.com/questions/30906412/noclassdeffounderror-com-apache-hadoop-fs-fsdatainputstream-when-execute-spark-s. It might get a little confusing so don't worry (we can go into the explanation later) - just respond on this with whatever the result is. — Lalit, Sep 27 '18 at 00:34

score 2 · Accepted Answer · answered Sep 27 '18 at 00:42

2

Solved this by adding the flowing dependency:

<dependency>
    <groupId>org.apache.hadoop</groupId>
    <artifactId>hadoop-common</artifactId>
    <version>3.1.1</version>
</dependency>

answered Sep 27 '18 at 00:42

Atihska

4,803
10
56
98

how to write it to s3 ? the same way you did here? – Aditya Verma Jan 12 '21 at 10:00
how to write it to s3 ? the same way you did here? Dataset dF = spark.read().load(input_path); so from my understanding we can use spark.write().load(output_path) – Aditya Verma Jan 12 '21 at 10:06

Unable to read data from AWS S3 in Java Dataset for Spark

1 Answers1