I have a DataFrame which will be created by hiveContext by executing a Hive SQL, the queried data should be pushed to different datastores in my case.
The DataFrame has got thousands of partitions because of the SQL that I am trying to execute.
To push the data onto datastores I use mapPartitions()
and obtain connections and push the data.
The load on the data destination is very high because of the number of partitions, I can coalsec()
the number of partitions to a required count based on the size of DataFrame.
The amount of data generated by the SQL is not same in all my cases. In few cases, it may be few 100s of records and in few cases it may go to few millions. Hence I would need a dynamic way to decide the number of partitions to coalsec()
.
After googling I could see that we can use SizeEstimator.estimate()
to estimate the size of DataFrame and then divide the count based on some calculations to get number of partitions. But looking at the implementation of SizeEstimator.estimate
at spark's repo showed me that it has been implemented for a single JVM stand point of view and should be used for objects like broadcast variables etc, but not for RDDs/DataFrames which are distributed across JVMs.
Can anyone suggest how to resolve my issue? and please let me know if my understanding is wrong.