How can I load a bunch of files from a S3 bucket into a single PySpark dataframe? I'm running on an EMR instance. If the file is local, I can use the SparkContext textFile method. But when the file is on S3, how can I use boto3 to load multiple files of various types (CSV, JSON, ...) into a single dataframe for processing?
Asked
Active
Viewed 1.8k times
1 Answers
8
Spark natively reads from S3 using Hadoop APIs, not Boto3. And textFile
is for reading RDD, not DataFrames. Also do not try to load two different formats into a single dataframe as you won't be able to consistently parse them
I would suggest using
csvDf = spark.read.csv("s3a://path/to/files/*.csv")
jsonDf = spark.read.json("s3a://path/to/files/*.json")
And from there, you can filter and join the dataframes using SparkSQL.
Note: JSON files need to contain single JSON objects each on their own line

OneCricketeer
- 179,855
- 19
- 132
- 245
-
You mean I shouldn't combine multiple file types into a single dataframe? What is the right way to combine data from multiple sources/types in the PySpark framework? – Paul Bendevis May 28 '18 at 16:59
-
2I'm getting a No FileSystem for scheme: s3 error. And I also tried s3n and s3a with similar errors. – Paul Bendevis May 28 '18 at 17:01
-
1To load multiple types, you need different parsers, as shown... You join or union the dataframes to get all your data together in a unified format. Regarding the errors, it should already work on EMR, but see https://stackoverflow.com/a/33787125/2308683 – OneCricketeer May 28 '18 at 17:45
-
Turns out my conda installation on the EMR is using a different pyspark than what was installed. – Paul Bendevis May 28 '18 at 18:22
-
Yeah, if your cluster already has Spark, you don't need Conda to manage it – OneCricketeer May 28 '18 at 18:24
-
I'm using Conda for managing versions of installed libraries for ML. I just had to export the PYSPARK_PYTHON variable to use the conda python. – Paul Bendevis May 29 '18 at 12:19