Can we use SizeEstimator.estimate for estimating size of RDD/DataFrame?

Question

I have a DataFrame which will be created by hiveContext by executing a Hive SQL, the queried data should be pushed to different datastores in my case.

The DataFrame has got thousands of partitions because of the SQL that I am trying to execute.

To push the data onto datastores I use mapPartitions() and obtain connections and push the data.

The load on the data destination is very high because of the number of partitions, I can coalsec() the number of partitions to a required count based on the size of DataFrame.

The amount of data generated by the SQL is not same in all my cases. In few cases, it may be few 100s of records and in few cases it may go to few millions. Hence I would need a dynamic way to decide the number of partitions to coalsec().

After googling I could see that we can use SizeEstimator.estimate() to estimate the size of DataFrame and then divide the count based on some calculations to get number of partitions. But looking at the implementation of SizeEstimator.estimate at spark's repo showed me that it has been implemented for a single JVM stand point of view and should be used for objects like broadcast variables etc, but not for RDDs/DataFrames which are distributed across JVMs.

Can anyone suggest how to resolve my issue? and please let me know if my understanding is wrong.

Size estimator estimate the size of the Object which far bigger than the size of the data you can try to see my solution in this links https://stackoverflow.com/questions/49423462/spark-dataset-efficiently-get-length-size-of-entire-row — 2Big2BeSmall, Mar 25 '18 at 06:29

Ram Ghadiyaram · Answer 1 · 2019-03-24T17:24:40.437

Can we use SizeEstimator.estimate for estimating size of RDD/DataFrame?

No we cant use for estimating size of RDD or Dataframe. it will give different sizes.

If you have a parquetfile on disk.. you can use estimate to know exact size of the file based on that number of partitions you can decide...

spark's repo showed me that it has been implemented for a single JVM stand point of view and should be used for objects like broadcast variables etc, but not for RDDs/DataFrames which are distributed across JVMs

This is correct.

See the test classes in spark SizeEstimatorSuite.scala to understand it better...

how can we do the same in java – Syed Ammar Mustafa Oct 21 '17 at 14:56 — Syed Ammar Mustafa, Oct 21 '17 at 14:56

Shakti Garg · Answer 2 · 2018-10-24T02:22:15.480

No, SizeEstimator.estimate can't be used to estimate size of RDD/DataFrame.

The reason is that it is used by Spark to estimate the size of java objects when it is creating RDD/DataFrame and doing operations on it. It uses basic java size method to find size of java objects.

When it comes to finding size of RDD/DataFrame(abstraction over RDD), they are serialized objects in memory distributed across JVMs. So, it never gives accurate size. It will give different number on each call.

Hard Worker · Answer 3 · 2022-03-21T11:48:23.157

0

you can get the size of Dataset with

private long estimateDatasetSize(Dataset<Row> cachedInputDF) {
    Statistics stats = cachedInputDF.queryExecution().logical().stats();
        return stats.sizeInBytes().longValue();
}

edited Mar 21 '22 at 11:48

answered Mar 21 '22 at 11:41

Hard Worker

995
11
33

Can we use SizeEstimator.estimate for estimating size of RDD/DataFrame?

3 Answers3