pyspark read multiple csv files at once

Question

I'm using SPARK to read files in hdfs. There is a scenario, where we are getting files as chunks from legacy system in csv format.

ID1_FILENAMEA_1.csv
ID1_FILENAMEA_2.csv
ID1_FILENAMEA_3.csv
ID1_FILENAMEA_4.csv
ID2_FILENAMEA_1.csv
ID2_FILENAMEA_2.csv
ID2_FILENAMEA_3.csv

This files are loaded to FILENAMEA in HIVE using HiveWareHouse Connector, with few transformation like adding default values. Similarly we have around 70 tables. Hive tables are created in ORC format. Tables are partitioned on ID. Right now, I'm processing all these files one by one. It's taking much time.

I want to make this process much faster. Files will be in GBs.

Is there is any way to read all the FILENAMEA files at the same time and load it to HIVE tables.

You spoke about 70 tables, do all those CSV files have the same schema ? Are all the files in the same directory ? if yes, do you have to read all files in this directory or only some of them ? Can you post the code you're currently using (only the read and write part, not transformation) ? Thanks in advance ! — Vincent Doba, Sep 27 '21 at 20:24
@VincentDoba: Thanks for replying back.. Each table has unique schema. Yes, all files in same directory. I have to read all files. I'm using spark.read.csv(filename).toDF(columns). — Raja, Sep 28 '21 at 06:26

Vincent Doba · Accepted Answer · 2021-09-28T07:08:48.547

22

You have two methods to read several CSV files in pyspark. If all CSV files are in the same directory and all have the same schema, you can read then at once by directly passing the path of directory as argument, as follow:

spark.read.csv('hdfs://path/to/directory')

If you have CSV files in different locations or CSV files in same directory but with other CSV/text files in it, you can pass them as string representing a list of path in .csv() method argument, as follow:

spark.read.csv('hdfs://path/to/filename1,hdfs://path/to/filename2')

You can have more information about how to read a CSV file with Spark here

If you need to build this list of paths from the list of files in HDFS directory, you can look at this answer, once you've created your list of paths, you can transform it to a string to pass to .csv() method with ','.join(your_file_list)

edited Sep 28 '21 at 07:08

answered Sep 28 '21 at 07:02

Vincent Doba

4,343
3
22
42

Thanks. Is there is any way to read all csv files with wildcard? like spark.read.csv('hdfs://path/to/*FILENAMEA*') – Raja Sep 28 '21 at 07:22
No, it will not work, `spark.read` takes only literal file paths. You have to first list and filter your file list outside spark in plain python, then pass it as a string as argument for `spark.read.csv()` – Vincent Doba Sep 28 '21 at 07:29
Each file will be in GBs. Performance wise, will it be good to proceed? – Raja Sep 28 '21 at 07:35
1

Yes, performance wise it will be good to proceed. Spark is designed for this kind of use cases. – Vincent Doba Sep 28 '21 at 08:32
Each table will have 500 files. Is there is any limit in handling the number of files? – Raja Sep 29 '21 at 15:40
1

I don't think there is any limit, but you should test this – Vincent Doba Sep 29 '21 at 15:41
Thanks. Will try that and let you know.. – Raja Sep 29 '21 at 16:22
this does not work – Sauron May 23 '22 at 16:42
1

In spark2.0 this does not work. For Spark 2.0 you need to do this: ```spark.read.csv(['hdfs://path/to/filename1','hdfs://path/to/filename2'])``` – Pyaive Oleg Aug 30 '22 at 11:48

score 0 · Answer 2 · answered Jan 31 '23 at 04:24

0

Using: spark.read.csv(["path1","path2","path3"...]) you can read multiple files from different paths. But that means you have first to make a list of the paths. A list not a string of comma-separated file paths

answered Jan 31 '23 at 04:24

AEChris

11
3

pyspark read multiple csv files at once

2 Answers2

Linked