How to get number of files read from S3 path in Spark

Question

So, I'm using the most generic S3 read code in Spark, it reads multiple files in my specified directory into a single dataframe:

val df=spark.read.option("sep", "\t")
  .option("inferSchema", "true")
  .option("encoding","UTF-8")
  .schema(sch)
  .csv("s3://my-bucket/my-directory/")

What would be the best way (if any) to get the number of files that were read from this path?

Related: https://stackoverflow.com/questions/56350298/using-spark-scala-in-emr-to-get-s3-object-size-folder-files — mck, Mar 15 '21 at 06:58
Thanks but this doesn't exactly solve my case. I want the number of files/partitions that are read — Debapratim Chakraborty, Mar 15 '21 at 07:42

blackbishop · Accepted Answer · 2021-03-15T10:15:49.807

1

You can try to count distinct input_file_name() :


val nbFiles = df.select(input_file_name()).distinct.count

Or using Hadoop FileSystem:

import org.apache.hadoop.fs.Path

val s3Path = new Path("s3://my-bucket/my-directory/")
val contentSummary = s3Path.getFileSystem(sc.hadoopConfiguration).getContentSummary(s3Path)

val nbFiles = contentSummary.getFileCount()

edited Mar 15 '21 at 10:15

answered Mar 15 '21 at 09:39

blackbishop

30,945
11
55
76

Thanks. As an add-on, what should I do (in Spark) to get the total filesize (in bytes) of all the files read from my directory? In other words, the total size of my dataframe? – Debapratim Chakraborty Mar 15 '21 at 13:14
1

@DebapratimChakraborty check out this [answer](https://stackoverflow.com/a/35008549/1386551) for the size. – blackbishop Mar 15 '21 at 14:42
Thanks. This solves my purpose, but I just want to ask - when I read a csv into a dataframe, does spark make any metadata about it? Like say size, number of rows etc? – Debapratim Chakraborty Mar 15 '21 at 14:48
1

Going to advise against getContentSummary as its a woefully single threaded inefficient treewalk. Avoid using against any object store – stevel Mar 15 '21 at 21:16

How to get number of files read from S3 path in Spark

1 Answers1