2

I have a very strange error when trying to read a parquet file from s3. I am using the following code snippet from spark book.

package com.knx.rtb.sample

import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.sql.SQLContext
import org.apache.spark.sql.functions._

// One method for defining the schema of an RDD is to make a case class with the desired column
// names and types.
case class Record(key: Int, value: String)

object SparkSql {
  def main(args: Array[String]) {
    val sparkConf = new SparkConf().setAppName("SparkSql")
    val sc = new SparkContext(sparkConf)
    sc.hadoopConfiguration.set("fs.s3n.awsAccessKeyId", "accesskey")
    sc.hadoopConfiguration.set("fs.s3n.awsSecretAccessKey", "secretKey+JJbat7uEQtX/")

    val sqlContext = new SQLContext(sc)

    // Importing the SQL context gives access to all the SQL functions and implicit conversions.
    import sqlContext.implicits._

    val df = sc.parallelize((1 to 100).map(i => Record(i, s"val_$i"))).toDF()

    //if I remove this line; then I got the error
    df.write.parquet("s3n://adx-test/hdfs/pair.parquet")

    // Read in parquet file.  Parquet files are self-describing so the schmema is preserved.
    val parquetFile = sqlContext.read.parquet("s3n://adx-test/hdfs/pair.parquet")

    // Queries can be run using the DSL on parequet files just like the original RDD.
    parquetFile.where($"key" === 1).select($"value".as("a")).collect().foreach(println)

    // These files can also be registered as tables.
    parquetFile.registerTempTable("parquetFile")
    println("Result of Parquet file:")
    sqlContext.sql("SELECT * FROM parquetFile").collect().foreach(println)

    sc.stop()
  }
}

The code snippet run without any problem. However, whenever I removed the line: df.write.parquet("s3n://adx-test/hdfs/pair.parquet") which means read the parquet file from s3 into a spark dataframe (without writing a parquet file first), I got an error:

Exception in thread "main" java.lang.IllegalArgumentException: AWS Access Key ID and Secret Access Key must be specified as the username or password (respectively) of a s3n URL, or by setting the fs.s3n.awsAccessKeyId or fs.s3n.awsSecretAccessKey properties (respectively).

It's pretty weird because I have already set the hadoopConfiguration s3AccessKeyId and secret in the top of the code snippet. I want to try using s3n url with format s3n://accessId:secret@bucket/path but it seems that when secret contains the / character; it won't work.

John Rotenstein
  • 241,921
  • 22
  • 380
  • 470
auxdx
  • 2,313
  • 3
  • 21
  • 25
  • This is related to [this question](https://stackoverflow.com/questions/24924808/how-to-specify-aws-access-key-id-and-secret-access-key-as-part-of-a-amazon-s3n-u). – dsalaj Aug 02 '20 at 17:36

1 Answers1

0

After upgrade to spark 1.5, the problem is solved.

auxdx
  • 2,313
  • 3
  • 21
  • 25