I would like to create a DataFrame from many small files located in the same directory. I plan to use read.csv
from pyspark.sql. I've learned that in RDD world, textFile
function is designed for reading small number of large files, whereas wholeTextFiles
function is designed for reading a large number of small files (e.g. see this thread). Does read.csv
use textFile
or wholeTextFiles
under the hood?
Asked
Active
Viewed 609 times
0

dgp
- 1
- 1
-
Best way to know is by reading the code. Best part is *Spark is open-source*! – ernest_k Mar 21 '18 at 19:53
-
I tried to read the source code of read.csv, so far I cannot find an answer there. – dgp Mar 21 '18 at 20:33
1 Answers
1
Yes thats possible, just give the path until the parent directory as
df = spark.read.csv('path until the parent directory where the files are located')
And you should get all the files read into one dataframe. If the files doesn't have the same number of csv rows then the number of columns is the one from the file which as the maximumn number of fields in a line.

Ramesh Maharjan
- 41,071
- 6
- 69
- 97
-
Thanks for a quick reply. I know `read.csv` can read many files. The question is about internals of `read.csv`. – dgp Mar 21 '18 at 19:03