4

I am trying to read data from was s3 into a spark dataset in Java, code running on IntelliJ. I have added AWS keys in Spark Session config as mentioned in the code below but I am still getting the following error.

I don't see any HadoopConfiguraion equivalent in Java [https://spark.apache.org/docs/latest/api/java/index.html] so set in SparkSession. Please correct me here, if wrong.

Caused by: com.amazonaws.AmazonClientException: No AWS Credentials provided by BasicAWSCredentialsProvider EnvironmentVariableCredentialsProvider InstanceProfileCredentialsProvider : com.amazonaws.SdkClientException: Unable to load credentials from service endpoint

Here is the code:

SparkSession spark  = SparkSession
                .builder()
                .master("local")
                .config("spark.hadoop.fs.s3a.impl","org.apache.hadoop.fs.s3a.S3AFileSystem")
                .config("spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version", "2")
                .config("spark.hadoop.fs.s3a.awsAccessKeyId", AWS_KEY)
                .config("spark.hadoop.fs.s3a.awsSecretAccessKey", AWS_SECRET_KEY)
                .getOrCreate();

        JavaSparkContext sc = new JavaSparkContext(spark.sparkContext());
        //System.out.println(System.class.path);

        Dataset<Row> dF = spark.read().load("s3a://bucket/abc.parquet");

    }

Here is pom.xml where I have added all spark and was dependencies. Now sure what to add now.

<dependencies>
        <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-core_2.11</artifactId>
            <version>2.3.2</version>
        </dependency>
        <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-sql_2.11</artifactId>
            <version>2.3.2</version>
        </dependency>
        <dependency>
            <groupId>com.amazonaws</groupId>
            <artifactId>aws-java-sdk</artifactId>
            <version>1.11.417</version>
        </dependency>
        <dependency>
            <groupId>org.apache.hadoop</groupId>
            <artifactId>hadoop-aws</artifactId>
            <version>3.1.1</version>
        </dependency>
        <dependency>
            <groupId>org.apache.hadoop</groupId>
            <artifactId>hadoop-hdfs</artifactId>
            <version>3.1.1</version>
        </dependency>
        <dependency>
            <groupId>org.apache.hadoop</groupId>
            <artifactId>hadoop-common</artifactId>
            <version>3.1.1</version>
        </dependency>
    </dependencies>
Atihska
  • 4,803
  • 10
  • 56
  • 98
  • The approach is exactly the same as before - access `SparkSession`, get Hadoop Configuration, set config there. – zero323 Sep 27 '18 at 09:47
  • @user6910411 What is the java equivalent of HadoopConfiguration? – Atihska Sep 27 '18 at 18:00
  • @user6910411 None of the solutions worked posted in the other question. This isn't a duplicate I feel. – Atihska Sep 27 '18 at 18:36
  • Found that you have to add AWS crews in SparkContext only and not SparkSession. https://stackoverflow.com/questions/52544293/how-does-java-find-spark-hadoop-and-aws-jars-in-intellij/52545423#52545423 – Atihska Sep 27 '18 at 21:31
  • You aren't using the correct key names for the s3a secret and key values. So they don't get picked up & you don't get access. Please: read our documentation https://hadoop.apache.org/docs/current/hadoop-aws/tools/hadoop-aws/index.html#S3A – stevel Oct 01 '18 at 12:50
  • Add Hadoop configurations into your spark session as mentioned at following link. https://stackoverflow.com/questions/52544293/how-does-java-find-spark-hadoop-and-aws-jars-in-intellij/52545423#52545423 – Mukund Tripathi Feb 09 '20 at 23:11

0 Answers0