Spark: reading many files with read.csv

Question

I would like to create a DataFrame from many small files located in the same directory. I plan to use read.csv from pyspark.sql. I've learned that in RDD world, textFile function is designed for reading small number of large files, whereas wholeTextFiles function is designed for reading a large number of small files (e.g. see this thread). Does read.csv use textFile or wholeTextFiles under the hood?

Best way to know is by reading the code. Best part is *Spark is open-source*! — ernest_k, Mar 21 '18 at 19:53
I tried to read the source code of read.csv, so far I cannot find an answer there. — dgp, Mar 21 '18 at 20:33

score 1 · Answer 1 · answered Mar 21 '18 at 18:31

1

Yes thats possible, just give the path until the parent directory as

df = spark.read.csv('path until the parent directory where the files are located')

And you should get all the files read into one dataframe. If the files doesn't have the same number of csv rows then the number of columns is the one from the file which as the maximumn number of fields in a line.

answered Mar 21 '18 at 18:31

Ramesh Maharjan

41,071
6
69
97

Thanks for a quick reply. I know `read.csv` can read many files. The question is about internals of `read.csv`. – dgp Mar 21 '18 at 19:03

Spark: reading many files with read.csv

1 Answers1