0

So, I'm using the most generic S3 read code in Spark, it reads multiple files in my specified directory into a single dataframe:

val df=spark.read.option("sep", "\t")
  .option("inferSchema", "true")
  .option("encoding","UTF-8")
  .schema(sch)
  .csv("s3://my-bucket/my-directory/")

What would be the best way (if any) to get the number of files that were read from this path?

blackbishop
  • 30,945
  • 11
  • 55
  • 76

1 Answers1

1

You can try to count distinct input_file_name() :


val nbFiles = df.select(input_file_name()).distinct.count

Or using Hadoop FileSystem:

import org.apache.hadoop.fs.Path

val s3Path = new Path("s3://my-bucket/my-directory/")
val contentSummary = s3Path.getFileSystem(sc.hadoopConfiguration).getContentSummary(s3Path)

val nbFiles = contentSummary.getFileCount()
blackbishop
  • 30,945
  • 11
  • 55
  • 76
  • Thanks. As an add-on, what should I do (in Spark) to get the total filesize (in bytes) of all the files read from my directory? In other words, the total size of my dataframe? – Debapratim Chakraborty Mar 15 '21 at 13:14
  • 1
    @DebapratimChakraborty check out this [answer](https://stackoverflow.com/a/35008549/1386551) for the size. – blackbishop Mar 15 '21 at 14:42
  • Thanks. This solves my purpose, but I just want to ask - when I read a csv into a dataframe, does spark make any metadata about it? Like say size, number of rows etc? – Debapratim Chakraborty Mar 15 '21 at 14:48
  • 1
    Going to advise against getContentSummary as its a woefully single threaded inefficient treewalk. Avoid using against any object store – stevel Mar 15 '21 at 21:16