how can you calculate the size of an apache spark data frame using pyspark?

Question

Is there a way to calculate the size in bytes of an Apache spark Data Frame using pyspark?

Possible duplicate of http://stackoverflow.com/questions/35008123/how-to-find-spark-rdd-dataframe-size — Himaprasoon, Jul 04 '16 at 08:44
I'm trying to limit the number of output files when exporting the data frame by repartitioning it based on its size. — Mihai Tache, Jul 04 '16 at 14:48
Here's a possible workaround. You can easily find out how many rows you're dealing with using a `df.count()` then use `df.write.option("maxRecordsPerFile", 10000).save(file/path/)` to get the exact number of output files you want. It also saves you a very costly `repartition`. Would this help ? — Omar, Apr 05 '19 at 17:31

score 2 · Answer 1 · answered Apr 19 '19 at 19:59

2

why don't you just cache the df and then look in the spark UI under storage and convert the units to bytes

df.cache()

answered Apr 19 '19 at 19:59

thePurplePython

1 Answers1