Unable to connect to S3 using spark-redshift library in java

Question

I am trying to create a table in Redshift based on the spark dataset. I am using spark-redshift driver in jdbc to achieve this locally. The code snippet to do this

data.write()
.format("com.databricks.spark.redshift")
.option("url", "jdbc:redshift://..")
.option("dbtable", "test_table")
.option("tempdir", "s3://temp")
.option("aws_iam_role", "arn:aws:iam::..")
.option("extracopyoptions", "region 'us-west-1'")
.mode(SaveMode.Append).save();

My maven pom.xml has following dependency:

<dependency>
   <groupId>com.databricks</groupId>
   <artifactId>spark-redshift_2.11</artifactId>
   <version>2.0.1</version>
</dependency>

I am using java 1.8. I am getting the following error:

java.io.IOException: No FileSystem for scheme: s3
    at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2660)
    at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2667)
    at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:94)
    at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2703)
    at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2685)
    at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:373)
    at com.databricks.spark.redshift.Utils$.assertThatFileSystemIsNotS3BlockFileSystem(Utils.scala:156)
    at com.databricks.spark.redshift.RedshiftWriter.saveToRedshift(RedshiftWriter.scala:340)
    at com.databricks.spark.redshift.DefaultSource.createRelation(DefaultSource.scala:106)
    at org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:45)
    at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70)
    at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68)
    at org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:86)
    at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
    at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
    at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
    at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
    at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)
    at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:80)
    at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:80)
    at org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:668)
    at org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:668)
    at org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:78)
    at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125)
    at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73)
    at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:668)
    at org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:276)
    at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:270)
    at com.peak.spark.jobs.SparkDataIngestJob.writeData(SparkDataIngestJob.java:196)
    at com.peak.spark.jobs.SparkDataIngestJob.exec(SparkDataIngestJob.java:123)
    at com.peak.spark.core.AbstractSparkJob.run(AbstractSparkJob.java:74)
    at com.peak.spark.core.SparkAppLauncher.onApplicationEvent(SparkAppLauncher.java:40)
    at com.peak.spark.core.SparkAppLauncher.onApplicationEvent(SparkAppLauncher.java:16)
    at org.springframework.context.event.SimpleApplicationEventMulticaster.invokeListener(SimpleApplicationEventMulticaster.java:151)
    at org.springframework.context.event.SimpleApplicationEventMulticaster.multicastEvent(SimpleApplicationEventMulticaster.java:128)
    at org.springframework.context.support.AbstractApplicationContext.publishEvent(AbstractApplicationContext.java:331)
    at org.springframework.context.support.AbstractApplicationContext.start(AbstractApplicationContext.java:1174)
    at com.peak.spark.core.SparkApp.launch(SparkApp.java:38)
    at com.peak.spark.core.SparkApp.main(SparkApp.java:55)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
    at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:849)
    at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:167)
    at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:195)
    at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:86)
    at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:924)
    at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:933)
    at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Exception in thread "main" java.lang.RuntimeException: Spark Job FailedNo FileSystem for scheme: s3
    at com.peak.spark.jobs.SparkDataIngestJob.exec(SparkDataIngestJob.java:162)
    at com.peak.spark.core.AbstractSparkJob.run(AbstractSparkJob.java:74)
    at com.peak.spark.core.SparkAppLauncher.onApplicationEvent(SparkAppLauncher.java:40)
    at com.peak.spark.core.SparkAppLauncher.onApplicationEvent(SparkAppLauncher.java:16)
    at org.springframework.context.event.SimpleApplicationEventMulticaster.invokeListener(SimpleApplicationEventMulticaster.java:151)
    at org.springframework.context.event.SimpleApplicationEventMulticaster.multicastEvent(SimpleApplicationEventMulticaster.java:128)
    at org.springframework.context.support.AbstractApplicationContext.publishEvent(AbstractApplicationContext.java:331)
    at org.springframework.context.support.AbstractApplicationContext.start(AbstractApplicationContext.java:1174)
    at com.peak.spark.core.SparkApp.launch(SparkApp.java:38)
    at com.peak.spark.core.SparkApp.main(SparkApp.java:55)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
    at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:849)
    at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:167)
    at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:195)
    at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:86)
    at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:924)
    at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:933)
    at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)

Please help me figure out what's wrong here.

When you say "achieve this locally", do you mean you are trying to do this on your laptop/desktop? — Shekhar, Jan 25 '19 at 09:13
ok that explains the error. Have you configured AWS credentials in your system? Are you able to connect to AWS S3 using AWS SDK or AWS cli? — Shekhar, Jan 25 '19 at 09:26

Shekhar · Answer 1 · 2019-01-28T09:55:01.317

0

Since you are trying to execute this code on your local system, your code won't know how to access s3 file system.

You can do one of the two things to solve this problem:

Configure AWS credentials in your system so that your code will somehow try to reach s3 bucket. I will not recommend this approach for various reasons.
Keep your file path in configuration file. Use 2 configuration files - one for testing code and other for production environment. In test environment, use paths like c:\path\to\your\dummy\folder\ and in production environment configuration file use paths like s3:\your_bucket_name\path\in\bucket.

Hope it helps.

edited Jan 28 '19 at 09:55

answered Jan 25 '19 at 09:32

Shekhar

11,438
36
130
186

I have configured AWS credentials on my local system. Moreover I am providing the iam role to the spark-redshift driver as well as mentioned in the code snippet. – Ritika Jan 25 '19 at 10:00
ok. Can you execute any S3 related command from your command prompt? Try something like listing files in your bucket "s3://temp/" – Shekhar Jan 25 '19 at 13:32

score 0 · Answer 2 · answered Jan 25 '19 at 09:32

0

I suppose you forget to include hadoop-aws package to your project. This package will allow you to work with s3:// schema

<!-- https://mvnrepository.com/artifact/org.apache.hadoop/hadoop-aws -->
<dependency>
    <groupId>org.apache.hadoop</groupId>
    <artifactId>hadoop-aws</artifactId>
    <version>2.6.0</version>
</dependency>

answered Jan 25 '19 at 09:32

Duy Nguyen

985
5
9

1. ```spark-submit --packages org.apache.hadoop:hadoop-aws:2.7.3``` try to add package when you run spark-submit. 2. Moreover, please try to change ````s3://``` to ```s3a://``` if # could not help you – Duy Nguyen Jan 25 '19 at 10:30
added the package in spark-submit and changed s3:// to s3a://. I get this error ```Exception in thread "main" java.lang.NoSuchMethodError: com.amazonaws.services.s3.transfer.TransferManager.(Lcom/amazonaws/services/s3/AmazonS3;Ljava/util/concurrent/ThreadPoolExecutor;)V at org.apache.hadoop.fs.s3a.S3AFileSystem.initialize(S3AFileSystem.java:287) ``` – Ritika Jan 25 '19 at 11:06
I had some similar issue. Please find the correct version of hadoop-aws. I think aws-java api is compatible with hadoop-aws 1.7.4 . Please try using that version once. – james.bondu Jan 25 '19 at 17:18

Unable to connect to S3 using spark-redshift library in java

2 Answers2