14

How to find size (in MB) of dataframe in pyspark ,

df=spark.read.json("/Filestore/tables/test.json") I want to find how the size of df or test.json

Aravindh
  • 141
  • 1
  • 1
  • 6
  • 2
    Does these answer your question? [How to estimate dataframe real size in pyspark?](https://stackoverflow.com/questions/37077432/how-to-estimate-dataframe-real-size-in-pyspark), https://stackoverflow.com/questions/46228138/how-to-find-pyspark-dataframe-memory-usage, https://stackoverflow.com/questions/39652767/pyspark-2-0-the-size-or-shape-of-a-dataframe – mazaneicha Jun 16 '20 at 15:19
  • Does this answer your question? [How to find the size or shape of a DataFrame in PySpark?](https://stackoverflow.com/questions/39652767/how-to-find-the-size-or-shape-of-a-dataframe-in-pyspark) – mazaneicha Nov 08 '22 at 14:23

3 Answers3

10

Late answer, but since google brought me here first I figure I'll add this answer based on the comment by user @hiryu here.

This is tested and working for me. This requires caching, so probably is best kept to notebook development.

# Need to cache the table (and force the cache to happen)
df.cache()
df.count() # force caching

# need to access hidden parameters from the `SparkSession` and `DataFrame`
catalyst_plan = df._jdf.queryExecution().logical()
size_bytes = spark._jsparkSession.sessionState().executePlan(catalyst_plan).optimizedPlan().stats().sizeInBytes()

# always try to remember to free cached data once finished
df.unpersist()

print("Total table size: ", convert_size_bytes(size_bytes))

You need to access the hidden _jdf and _jSparkSession variables. Since Python objects do not expose the needed attributes directly, they won't be shown by IntelliSense.

Bonus:

My convert_size_bytes function looks like:

def convert_size_bytes(size_bytes):
    """
    Converts a size in bytes to a human readable string using SI units.
    """
    import math
    import sys

    if not isinstance(size_bytes, int):
        size_bytes = sys.getsizeof(size_bytes)

    if size_bytes == 0:
        return "0B"

    size_name = ("B", "KB", "MB", "GB", "TB", "PB", "EB", "ZB", "YB")
    i = int(math.floor(math.log(size_bytes, 1024)))
    p = math.pow(1024, i)
    s = round(size_bytes / p, 2)
    return "%s %s" % (s, size_name[i])
L Co
  • 805
  • 11
  • 17
  • I got the error: `py4j.Py4JException: Method executePlan([class org.apache.spark.sql.catalyst.plans.logical.Filter]) does not exist` I suggest using ```python # Need to cache the table (and force the cache to happen) df.cache() nrows = df.count() # force caching # need to access hidden parameters from the `SparkSession` and `DataFrame` size_bytes = sc._jvm.org.apache.spark.util.SizeEstimator.estimate(df._jdf) ``` – nenetto Jul 19 '23 at 10:24
1

In general this is not easy. You can

  • use org.apache.spark.util.SizeEstimator
  • use an approach which involves caching, see e.g. https://stackoverflow.com/a/49529028/1138523
  • use df.inputfiles() and use an other API to get the file size directly (I did so using Hadoop Filesystem API (How to get file size). Not that only works if the dataframe was not fitered/aggregated
Raphael Roth
  • 26,751
  • 15
  • 88
  • 145
  • 3
    This is Scala. The question asks about Python (PySpark). – earth2jason Mar 06 '23 at 18:45
  • would love to know if there is an equivalent of this method for pyspark, or find a pointer to where it is in the scala source code so we can see what it's doing. – szeitlin Aug 24 '23 at 21:44
1

My running version

# Need to cache the table (and force the cache to happen)
df.cache()
nrows = df.count() # force caching
    
# need to access hidden parameters from the `SparkSession` and `DataFrame`
size_bytes = sc._jvm.org.apache.spark.util.SizeEstimator.estimate(df._jdf)
nenetto
  • 354
  • 2
  • 6
  • 1
    this works, thank you... except the value I got back makes no sense. I guess the estimator can be wrong... – szeitlin Aug 24 '23 at 21:45