63

Consider I have a defined schema for loading 10 csv files in a folder. Is there a way to automatically load tables using Spark SQL. I know this can be performed by using an individual dataframe for each file [given below], but can it be automated with a single command rather than pointing a file can I point a folder?

df = sqlContext.read
       .format("com.databricks.spark.csv")
       .option("header", "true")
       .load("../Downloads/2008.csv")
ZygD
  • 22,092
  • 39
  • 79
  • 102
Chendur
  • 1,099
  • 1
  • 11
  • 23

6 Answers6

113

Use wildcard, e.g. replace 2008 with *:

df = sqlContext.read
       .format("com.databricks.spark.csv")
       .option("header", "true")
       .load("../Downloads/*.csv") // <-- note the star (*)

Spark 2.0

// these lines are equivalent in Spark 2.0
spark.read.format("csv").option("header", "true").load("../Downloads/*.csv")
spark.read.option("header", "true").csv("../Downloads/*.csv")

Notes:

  1. Replace format("com.databricks.spark.csv") by using format("csv") or csv method instead. com.databricks.spark.csv format has been integrated to 2.0.

  2. Use spark not sqlContext

Jacek Laskowski
  • 72,696
  • 27
  • 242
  • 420
Yaron
  • 10,166
  • 9
  • 45
  • 65
  • 1
    The first solution seems to only load the first csv in the folder. Do you know how to load them all? – mdornfe1 Nov 20 '16 at 22:55
  • @mdornfe1 - it should work. we need to understand the details (spark version, exact command used, csv files, etc) of the case when it didn't work for you. if it is short, add a comment with the details, otherwise open a new question. – Yaron Nov 21 '16 at 06:54
  • Is there a way to load files from different depths of the file structure? Say, `.../Downloads/first.csv` as well as `.../Downloads/subfolder/second.csv`? Using `.csv(".../Downloads/*/*.csv)` doesn't do that. – NotYanka Mar 27 '17 at 08:46
  • 2
    @NotYanka you might want to check out the other answer -- `load` will accept a list of paths as strings, and each of them may contain a wildcard. – mattsilver Apr 22 '17 at 14:50
  • 2
    @ohruunuruus i tried the other answer but it doesn't accept a list of paths. any other suggestions? – Abu Shoeb Oct 31 '17 at 06:12
  • @Yaron, what if the column names are not exactly the same between csvs? How will the schema be inferred? Is there a way to enforce a union of the column names to be used in the schema? I do not use inferschema, it simply reads in all as string type... I – leonard Apr 03 '18 at 18:12
  • works just fine with `sqlContext.read.load` v2.3 (pyspark), what is the reason for using `spark` instead of `sqlContext` – muon Apr 11 '18 at 22:05
  • When dealing with files that will have the same columns but in different order, does spark know how to map the data from different files to the correct columns? – Frank Pinto Jan 20 '20 at 19:55
27

Ex1:

Reading a single CSV file. Provide complete file path:

 val df = spark.read.option("header", "true").csv("C:spark\\sample_data\\tmp\\cars1.csv")

Ex2:

Reading multiple CSV files passing names:

val df=spark.read.option("header","true").csv("C:spark\\sample_data\\tmp\\cars1.csv", "C:spark\\sample_data\\tmp\\cars2.csv")

Ex3:

Reading multiple CSV files passing list of names:

val paths = List("C:spark\\sample_data\\tmp\\cars1.csv", "C:spark\\sample_data\\tmp\\cars2.csv")
val df = spark.read.option("header", "true").csv(paths: _*)

Ex4:

Reading multiple CSV files in a folder ignoring other files:

val df = spark.read.option("header", "true").csv("C:spark\\sample_data\\tmp\\*.csv")

Ex5:

Reading multiple CSV files from multiple folders:

val folders = List("C:spark\\sample_data\\tmp", "C:spark\\sample_data\\tmp1")
val df = spark.read.option("header", "true").csv(folders: _*)
Unheilig
  • 16,196
  • 193
  • 68
  • 98
mputha
  • 395
  • 5
  • 7
23

Note that you can use other tricks like :

-- One or more wildcard:
       .../Downloads20*/*.csv
--  braces and brackets   
       .../Downloads201[1-5]/book.csv
       .../Downloads201{11,15,19,99}/book.csv
Jamal Jam
  • 231
  • 2
  • 4
  • Nice! I wish this was documented. But could not find it at https://spark.apache.org/docs/latest/api/python/index.html. – flow2k Jul 03 '19 at 08:57
19

Reader's Digest: (Spark 2.x)

For Example, if you have 3 directories holding csv files:

dir1, dir2, dir3

You then define paths as a string of comma delimited list of paths as follows:

paths = "dir1/,dir2/,dir3/*"

Then use the following function and pass it this paths variable

def get_df_from_csv_paths(paths):

        df = spark.read.format("csv").option("header", "false").\
            schema(custom_schema).\
            option('delimiter', '\t').\
            option('mode', 'DROPMALFORMED').\
            load(paths.split(','))
        return df

By then running:

df = get_df_from_csv_paths(paths)

You will obtain in df a single spark dataframe containing the data from all the csvs found in these 3 directories.

===========================================================================

Full Version:

In case you want to ingest multiple CSVs from multiple directories you simply need to pass a list and use wildcards.

For Example:

if your data_path looks like this:

's3://bucket_name/subbucket_name/2016-09-*/184/*,
s3://bucket_name/subbucket_name/2016-10-*/184/*,
s3://bucket_name/subbucket_name/2016-11-*/184/*,
s3://bucket_name/subbucket_name/2016-12-*/184/*, ... '

you can use the above function to ingest all the csvs in all these directories and subdirectories at once:

This would ingest all directories in s3 bucket_name/subbucket_name/ according to the wildcard patterns specified. e.g. the first pattern would look in

bucket_name/subbucket_name/

for all directories with names starting with

2016-09-

and for each of those take only the directory named

184

and within that subdirectory look for all csv files.

And this would be executed for each of the patterns in the comma delimited list.

This works way better than union..

eiTan LaVi
  • 2,901
  • 24
  • 15
  • 3
    This is the answer I was looking for, but the answer is way more complicated than it needs to be. It could benefit from a simple example in which a list of paths is passed. Generating a list from a string containing comma separated paths is convenient, but a bit outside the scope of the question. This and other stuff, (e.g. fancy feature-rich function), caused me to initially overlook this answer. – mattsilver Apr 22 '17 at 14:53
  • This is the answer I have been looking for too. This takes care of getting data from multiple buckets/directories/formats etc. – Vivek Jul 19 '17 at 06:47
  • 1
    Agree with @ohruunuruus - good answer, but folks here are generally fluent enough in Python to know how to construct lists. – flow2k Jul 03 '19 at 08:54
5

Using Spark 2.0+, we can load multiple CSV files from different directories using df = spark.read.csv(['directory_1','directory_2','directory_3'.....], header=True). For more information, refer the documentation here

Neeleshkumar S
  • 746
  • 11
  • 19
2
val df = spark.read.option("header", "true").csv("C:spark\\sample_data\\*.csv)

will consider files tmp, tmp1, tmp2, ....

fcdt
  • 2,371
  • 5
  • 14
  • 26