How to find size (in MB) of dataframe in pyspark ,
df=spark.read.json("/Filestore/tables/test.json") I want to find how the size of df or test.json
How to find size (in MB) of dataframe in pyspark ,
df=spark.read.json("/Filestore/tables/test.json") I want to find how the size of df or test.json
Late answer, but since google brought me here first I figure I'll add this answer based on the comment by user @hiryu here.
This is tested and working for me. This requires caching, so probably is best kept to notebook development.
# Need to cache the table (and force the cache to happen)
df.cache()
df.count() # force caching
# need to access hidden parameters from the `SparkSession` and `DataFrame`
catalyst_plan = df._jdf.queryExecution().logical()
size_bytes = spark._jsparkSession.sessionState().executePlan(catalyst_plan).optimizedPlan().stats().sizeInBytes()
# always try to remember to free cached data once finished
df.unpersist()
print("Total table size: ", convert_size_bytes(size_bytes))
You need to access the hidden
_jdf
and_jSparkSession
variables. Since Python objects do not expose the needed attributes directly, they won't be shown by IntelliSense.
My convert_size_bytes
function looks like:
def convert_size_bytes(size_bytes):
"""
Converts a size in bytes to a human readable string using SI units.
"""
import math
import sys
if not isinstance(size_bytes, int):
size_bytes = sys.getsizeof(size_bytes)
if size_bytes == 0:
return "0B"
size_name = ("B", "KB", "MB", "GB", "TB", "PB", "EB", "ZB", "YB")
i = int(math.floor(math.log(size_bytes, 1024)))
p = math.pow(1024, i)
s = round(size_bytes / p, 2)
return "%s %s" % (s, size_name[i])
In general this is not easy. You can
org.apache.spark.util.SizeEstimator
df.inputfiles()
and use an other API to get the file size directly (I did so using Hadoop Filesystem API (How to get file size). Not that only works if the dataframe was not fitered/aggregatedMy running version
# Need to cache the table (and force the cache to happen)
df.cache()
nrows = df.count() # force caching
# need to access hidden parameters from the `SparkSession` and `DataFrame`
size_bytes = sc._jvm.org.apache.spark.util.SizeEstimator.estimate(df._jdf)