Parsing files from Amazon S3 with Apache Spark

Question

I am using Apache Spark and I have to parse files from Amazon S3. How would I know file extension while fetching the files from Amazon S3 path?

freedev · Accepted Answer · 2017-04-27T10:29:33.383

2

I suggest to follow Cloudera tutorial Accessing Data Stored in Amazon S3 through Spark

To access data stored in Amazon S3 from Spark applications, you could use Hadoop file APIs (SparkContext.hadoopFile, JavaHadoopRDD.saveAsHadoopFile, SparkContext.newAPIHadoopRDD, and JavaHadoopRDD.saveAsNewAPIHadoopFile) for reading and writing RDDs, providing URLs of the form s3a://bucket_name/path/to/file.txt.

You can read and write Spark SQL DataFrames using the Data Source API.

Regarding the file extension, there are few solutions. You could simply take the extension by the filename (i.e. file.txt).

If your extensions were removed by files stored in your S3 buckets, you could still know the content-type looking at metadata added for each S3 resource.

http://docs.aws.amazon.com/AmazonS3/latest/API/RESTObjectHEAD.html

edited Apr 27 '17 at 10:29

answered Apr 27 '17 at 10:20

freedev

25,946
8
108
125

Thank you for your answer. one more question is that How I would know the file extension like(json,csv,txt) that which type of files I am getting from S3. – Vpn_talent Apr 27 '17 at 10:23
Why you're looking for the extension? Don't you have the extension names at end of your s3 files? – freedev Apr 27 '17 at 11:05
Thank you for your guidance. I got the answer as you wrote about finding the extension. – Vpn_talent Apr 27 '17 at 11:13

Parsing files from Amazon S3 with Apache Spark

1 Answers1

Linked