2

I'm trying to export data from EMR master node to S3 bucket, its failing. While executing below line of code from my pyspark code:

DF1
.coalesce(1)
.write
.format("csv")
.option("header","true")
.save("s3://fittech-bucket/emr/outputs/test_data")

below error comes:

An error occurred while calling o78.save.
: java.lang.RuntimeException: java.lang.ClassNotFoundException: Class com.amazon.ws.emr.hadoop.fs.EmrFileSystem not found
        at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2369)
        at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2840)
        at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2857)
        at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:99)
        at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2896)
        at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2878)
        at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:392)
        at org.apache.hadoop.fs.Path.getFileSystem(Path.java:356)
        at org.apache.spark.sql.execution.datasources.DataSource.planForWritingFileFormat(DataSource.scala:452)
        at org.apache.spark.sql.execution.datasources.DataSource.planForWriting(DataSource.scala:548)
        at org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:278)
        at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:267)
        at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:225)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
        at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
        at py4j.Gateway.invoke(Gateway.java:282)
        at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
        at py4j.commands.CallCommand.execute(CallCommand.java:79)
        at py4j.GatewayConnection.run(GatewayConnection.java:238)
        at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.ClassNotFoundException: Class com.amazon.ws.emr.hadoop.fs.EmrFileSystem not found
        at org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:2273)
        at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2367)
Yuriy Bondaruk
  • 4,512
  • 2
  • 33
  • 49
Sreeni
  • 91
  • 1
  • 3
  • 10

1 Answers1

0

Try writing directly to your local HDFS file system and then using aws s3 cp to copy your local files to S3. Alternatively, you could enable EMRFS and use sync so that it automatically pushes your local changes to S3. See https://docs.aws.amazon.com/emr/latest/ManagementGuide/emrfs-cli-reference.html for the EMRFS reference. It might be a workaround, but it should solve your primary issue. Plus, you get a number of benefits if you use EMRFS. If you want to execute the EMRFS sync command from within Python, since I'm not sure if there's a way to do that from boto3, you can do that by executing a bash command from Python like this: Running Bash commands in Python

If you just want to use boto3 to push the file(s) to S3, file upload to S3 via Boto3 is documented here: https://boto3.amazonaws.com/v1/documentation/api/latest/guide/s3-example-creating-buckets.html

You can also use s3-dist-cp or hadoop fs to copy to/from S3, as mentioned here: How does EMR handle an s3 bucket for input and output?

devinbost
  • 4,658
  • 2
  • 44
  • 57