Is there a way to calculate the size in bytes of an Apache spark Data Frame using pyspark?
Asked
Active
Viewed 6,975 times
11
-
1What exactly do you expect to learn from this? – zero323 Jul 04 '16 at 08:40
-
1Possible duplicate of http://stackoverflow.com/questions/35008123/how-to-find-spark-rdd-dataframe-size – Himaprasoon Jul 04 '16 at 08:44
-
3I'm trying to limit the number of output files when exporting the data frame by repartitioning it based on its size. – Mihai Tache Jul 04 '16 at 14:48
-
Here's a possible workaround. You can easily find out how many rows you're dealing with using a `df.count()` then use `df.write.option("maxRecordsPerFile", 10000).save(file/path/)` to get the exact number of output files you want. It also saves you a very costly `repartition`. Would this help ? – Omar Apr 05 '19 at 17:31
1 Answers
2
why don't you just cache the df and then look in the spark UI under storage and convert the units to bytes
df.cache()

thePurplePython
- 2,621
- 1
- 13
- 34