I am using Apache Spark and I have to parse files from Amazon S3. How would I know file extension while fetching the files from Amazon S3 path?
1 Answers
I suggest to follow Cloudera tutorial Accessing Data Stored in Amazon S3 through Spark
To access data stored in Amazon S3 from Spark applications, you could use Hadoop file APIs (
SparkContext.hadoopFile
,JavaHadoopRDD.saveAsHadoopFile
,SparkContext.newAPIHadoopRDD
, andJavaHadoopRDD.saveAsNewAPIHadoopFile
) for reading and writing RDDs, providing URLs of the forms3a://bucket_name/path/to/file.txt
.You can read and write Spark SQL DataFrames using the Data Source API.
Regarding the file extension, there are few solutions.
You could simply take the extension by the filename (i.e. file.txt
).
If your extensions were removed by files stored in your S3 buckets, you could still know the content-type looking at metadata added for each S3 resource.
http://docs.aws.amazon.com/AmazonS3/latest/API/RESTObjectHEAD.html

- 25,946
- 8
- 108
- 125
-
Thank you for your answer. one more question is that How I would know the file extension like(json,csv,txt) that which type of files I am getting from S3. – Vpn_talent Apr 27 '17 at 10:23
-
Why you're looking for the extension? Don't you have the extension names at end of your s3 files? – freedev Apr 27 '17 at 11:05
-
Thank you for your guidance. I got the answer as you wrote about finding the extension. – Vpn_talent Apr 27 '17 at 11:13