25

How to determine a dataframe size?

Right now I estimate the real size of a dataframe as follows:

headers_size = key for key in df.first().asDict()
rows_size = df.map(lambda row: len(value for key, value in row.asDict()).sum()
total_size = headers_size + rows_size

It is too slow and I'm looking for a better way.

TheSilence
  • 342
  • 1
  • 3
  • 11

2 Answers2

17

Currently I am using the below approach, but not sure if this is the best way:

df.persist(StorageLevel.Memory)
df.count()

On the spark-web UI under the Storage tab you can check the size which is displayed in MB's and then I do unpersist to clear the memory:

df.unpersist()
David C.
  • 1,974
  • 2
  • 19
  • 29
Kiran Thati
  • 321
  • 2
  • 9
16

nice post from Tamas Szuromi http://metricbrew.com/how-to-estimate-rdd-or-dataframe-real-size-in-pyspark/

from pyspark.serializers import PickleSerializer, AutoBatchedSerializer
def _to_java_object_rdd(rdd):  
    """ Return a JavaRDD of Object by unpickling
    It will convert each Python object into Java object by Pyrolite, whenever the
    RDD is serialized in batch or not.
    """
    rdd = rdd._reserialize(AutoBatchedSerializer(PickleSerializer()))
    return rdd.ctx._jvm.org.apache.spark.mllib.api.python.SerDe.pythonToJava(rdd._jrdd, True)

JavaObj = _to_java_object_rdd(df.rdd)

nbytes = sc._jvm.org.apache.spark.util.SizeEstimator.estimate(JavaObj)
Ziggy Eunicien
  • 2,858
  • 1
  • 23
  • 28
  • 1
    How does this suppose to work? i have tested this code and, in my opinion, the results are more of a "random function" as of an estimation. Or maybe did i misinterpret them? I am using spark 1.6 in cdh 5.11.2 – sdikby Sep 27 '17 at 14:34
  • 5
    This returns always the same size for me, no matter the dataframe. it always returns 216 MB. – makansij Dec 14 '17 at 22:29
  • I saw very little change -- from 185,704,232 to 186,020,448 to 187,366,176. However, the number of records changed from 5 to 2,000,000 to 1,500,000,000. – Jie Jan 31 '20 at 22:58
  • I use pyspark 2.4.4 ,is not worked,TypeError javaPackage not callable – SummersKing Jun 08 '20 at 08:43
  • 3
    Do not use this. This is not true memory usage. It reports close number for a DataFrame of 1B records and another one with 10M records. – Tony Aug 06 '20 at 09:47