How to find size (in MB) of dataframe in pyspark?

Question

How to find size (in MB) of dataframe in pyspark ,

df=spark.read.json("/Filestore/tables/test.json") I want to find how the size of df or test.json

Does these answer your question? [How to estimate dataframe real size in pyspark?](https://stackoverflow.com/questions/37077432/how-to-estimate-dataframe-real-size-in-pyspark), https://stackoverflow.com/questions/46228138/how-to-find-pyspark-dataframe-memory-usage, https://stackoverflow.com/questions/39652767/pyspark-2-0-the-size-or-shape-of-a-dataframe — mazaneicha, Jun 16 '20 at 15:19
Does this answer your question? [How to find the size or shape of a DataFrame in PySpark?](https://stackoverflow.com/questions/39652767/how-to-find-the-size-or-shape-of-a-dataframe-in-pyspark) — mazaneicha, Nov 08 '22 at 14:23

L Co · Answer 1 · 2023-02-07T17:08:18.510

Late answer, but since google brought me here first I figure I'll add this answer based on the comment by user @hiryu here.

This is tested and working for me. This requires caching, so probably is best kept to notebook development.

# Need to cache the table (and force the cache to happen)
df.cache()
df.count() # force caching

# need to access hidden parameters from the `SparkSession` and `DataFrame`
catalyst_plan = df._jdf.queryExecution().logical()
size_bytes = spark._jsparkSession.sessionState().executePlan(catalyst_plan).optimizedPlan().stats().sizeInBytes()

# always try to remember to free cached data once finished
df.unpersist()

print("Total table size: ", convert_size_bytes(size_bytes))

You need to access the hidden _jdf and _jSparkSession variables. Since Python objects do not expose the needed attributes directly, they won't be shown by IntelliSense.

Bonus:

My convert_size_bytes function looks like:

def convert_size_bytes(size_bytes):
    """
    Converts a size in bytes to a human readable string using SI units.
    """
    import math
    import sys

    if not isinstance(size_bytes, int):
        size_bytes = sys.getsizeof(size_bytes)

    if size_bytes == 0:
        return "0B"

    size_name = ("B", "KB", "MB", "GB", "TB", "PB", "EB", "ZB", "YB")
    i = int(math.floor(math.log(size_bytes, 1024)))
    p = math.pow(1024, i)
    s = round(size_bytes / p, 2)
    return "%s %s" % (s, size_name[i])

I got the error: `py4j.Py4JException: Method executePlan([class org.apache.spark.sql.catalyst.plans.logical.Filter]) does not exist` I suggest using ```python # Need to cache the table (and force the cache to happen) df.cache() nrows = df.count() # force caching # need to access hidden parameters from the `SparkSession` and `DataFrame` size_bytes = sc._jvm.org.apache.spark.util.SizeEstimator.estimate(df._jdf) ``` — nenetto, Jul 19 '23 at 10:24

score 1 · Answer 2 · answered Jun 16 '20 at 19:26

1

In general this is not easy. You can

use org.apache.spark.util.SizeEstimator
use an approach which involves caching, see e.g. https://stackoverflow.com/a/49529028/1138523
use df.inputfiles() and use an other API to get the file size directly (I did so using Hadoop Filesystem API (How to get file size). Not that only works if the dataframe was not fitered/aggregated

answered Jun 16 '20 at 19:26

Raphael Roth

26,751
15
88
145

3

This is Scala. The question asks about Python (PySpark). – earth2jason Mar 06 '23 at 18:45
would love to know if there is an equivalent of this method for pyspark, or find a pointer to where it is in the scala source code so we can see what it's doing. – szeitlin Aug 24 '23 at 21:44

score 1 · Answer 3 · answered Jul 19 '23 at 10:25

1

My running version

# Need to cache the table (and force the cache to happen)
df.cache()
nrows = df.count() # force caching
    
# need to access hidden parameters from the `SparkSession` and `DataFrame`
size_bytes = sc._jvm.org.apache.spark.util.SizeEstimator.estimate(df._jdf)

answered Jul 19 '23 at 10:25

nenetto

354
2
6

1

this works, thank you... except the value I got back makes no sense. I guess the estimator can be wrong... – szeitlin Aug 24 '23 at 21:45

How to find size (in MB) of dataframe in pyspark?

3 Answers3

Bonus: