0

I would like to create a DataFrame from many small files located in the same directory. I plan to use read.csv from pyspark.sql. I've learned that in RDD world, textFile function is designed for reading small number of large files, whereas wholeTextFiles function is designed for reading a large number of small files (e.g. see this thread). Does read.csv use textFile or wholeTextFiles under the hood?

dgp
  • 1
  • 1

1 Answers1

1

Yes thats possible, just give the path until the parent directory as

df = spark.read.csv('path until the parent directory where the files are located')

And you should get all the files read into one dataframe. If the files doesn't have the same number of csv rows then the number of columns is the one from the file which as the maximumn number of fields in a line.

Ramesh Maharjan
  • 41,071
  • 6
  • 69
  • 97
  • Thanks for a quick reply. I know `read.csv` can read many files. The question is about internals of `read.csv`. – dgp Mar 21 '18 at 19:03