4

How can I load a bunch of files from a S3 bucket into a single PySpark dataframe? I'm running on an EMR instance. If the file is local, I can use the SparkContext textFile method. But when the file is on S3, how can I use boto3 to load multiple files of various types (CSV, JSON, ...) into a single dataframe for processing?

Phantômaxx
  • 37,901
  • 21
  • 84
  • 115
Paul Bendevis
  • 2,381
  • 2
  • 31
  • 42

1 Answers1

8

Spark natively reads from S3 using Hadoop APIs, not Boto3. And textFile is for reading RDD, not DataFrames. Also do not try to load two different formats into a single dataframe as you won't be able to consistently parse them

I would suggest using

csvDf = spark.read.csv("s3a://path/to/files/*.csv")
jsonDf = spark.read.json("s3a://path/to/files/*.json")

And from there, you can filter and join the dataframes using SparkSQL.

Note: JSON files need to contain single JSON objects each on their own line

OneCricketeer
  • 179,855
  • 19
  • 132
  • 245
  • You mean I shouldn't combine multiple file types into a single dataframe? What is the right way to combine data from multiple sources/types in the PySpark framework? – Paul Bendevis May 28 '18 at 16:59
  • 2
    I'm getting a No FileSystem for scheme: s3 error. And I also tried s3n and s3a with similar errors. – Paul Bendevis May 28 '18 at 17:01
  • 1
    To load multiple types, you need different parsers, as shown... You join or union the dataframes to get all your data together in a unified format. Regarding the errors, it should already work on EMR, but see https://stackoverflow.com/a/33787125/2308683 – OneCricketeer May 28 '18 at 17:45
  • Turns out my conda installation on the EMR is using a different pyspark than what was installed. – Paul Bendevis May 28 '18 at 18:22
  • Yeah, if your cluster already has Spark, you don't need Conda to manage it – OneCricketeer May 28 '18 at 18:24
  • I'm using Conda for managing versions of installed libraries for ML. I just had to export the PYSPARK_PYTHON variable to use the conda python. – Paul Bendevis May 29 '18 at 12:19