Reading Files from S3 Bucket to PySpark Dataframe Boto3

Question

How can I load a bunch of files from a S3 bucket into a single PySpark dataframe? I'm running on an EMR instance. If the file is local, I can use the SparkContext textFile method. But when the file is on S3, how can I use boto3 to load multiple files of various types (CSV, JSON, ...) into a single dataframe for processing?

OneCricketeer · Accepted Answer · 2018-05-28T21:15:26.597

8

Spark natively reads from S3 using Hadoop APIs, not Boto3. And textFile is for reading RDD, not DataFrames. Also do not try to load two different formats into a single dataframe as you won't be able to consistently parse them

I would suggest using

csvDf = spark.read.csv("s3a://path/to/files/*.csv")
jsonDf = spark.read.json("s3a://path/to/files/*.json")

And from there, you can filter and join the dataframes using SparkSQL.

Note: JSON files need to contain single JSON objects each on their own line

edited May 28 '18 at 21:15

answered May 28 '18 at 16:35

OneCricketeer

179,855
19
132
245

You mean I shouldn't combine multiple file types into a single dataframe? What is the right way to combine data from multiple sources/types in the PySpark framework? – Paul Bendevis May 28 '18 at 16:59
2

I'm getting a No FileSystem for scheme: s3 error. And I also tried s3n and s3a with similar errors. – Paul Bendevis May 28 '18 at 17:01
1

To load multiple types, you need different parsers, as shown... You join or union the dataframes to get all your data together in a unified format. Regarding the errors, it should already work on EMR, but see https://stackoverflow.com/a/33787125/2308683 – OneCricketeer May 28 '18 at 17:45
Turns out my conda installation on the EMR is using a different pyspark than what was installed. – Paul Bendevis May 28 '18 at 18:22
Yeah, if your cluster already has Spark, you don't need Conda to manage it – OneCricketeer May 28 '18 at 18:24
I'm using Conda for managing versions of installed libraries for ML. I just had to export the PYSPARK_PYTHON variable to use the conda python. – Paul Bendevis May 29 '18 at 12:19

Reading Files from S3 Bucket to PySpark Dataframe Boto3

1 Answers1