11

Is there a way to calculate the size in bytes of an Apache spark Data Frame using pyspark?

Mihai Tache
  • 161
  • 3
  • 9
  • 1
    What exactly do you expect to learn from this? – zero323 Jul 04 '16 at 08:40
  • 1
    Possible duplicate of http://stackoverflow.com/questions/35008123/how-to-find-spark-rdd-dataframe-size – Himaprasoon Jul 04 '16 at 08:44
  • 3
    I'm trying to limit the number of output files when exporting the data frame by repartitioning it based on its size. – Mihai Tache Jul 04 '16 at 14:48
  • Here's a possible workaround. You can easily find out how many rows you're dealing with using a `df.count()` then use `df.write.option("maxRecordsPerFile", 10000).save(file/path/)` to get the exact number of output files you want. It also saves you a very costly `repartition`. Would this help ? – Omar Apr 05 '19 at 17:31

1 Answers1

2

why don't you just cache the df and then look in the spark UI under storage and convert the units to bytes

df.cache()
thePurplePython
  • 2,621
  • 1
  • 13
  • 34