In my application, I have a spark dataset of X rows I have different CSV files each one with different size and structure. I'm generating a Dataset over this CSV's.
Before posting this question I saw these questions:
- How can I find the size of a RDD
- how can you calculate the size of an apache spark data frame using pyspark?
- How to find spark RDD/Dataframe size?
- How to get a sample with an exact sample size in Spark RDD?
I need to calculate the size of each partition during runtime The result of the files are ORC(snappy compression)
all of the above questions offering to use Size Estimator
So I also read about Size Estimator
When i did try to use this Size Estimator of the
SizeEstimator.estimate(dataFrame.rdd().partitions())
I got this results: 71.124 MB, I have also try to use estimate
of a sample with partials file reading - which results in the same size.
Seeing this result - just don't make sense, Here some more details:
Source file size 44.8 KB (CSV) - 300 rows.
SizeEstimator.estimate(dataSet.rdd().partitions()) 71.124 MB
The actual data frame results on run time are stored to S3:
dataSet.write().partitionBy(partitionColumn).option("header", "true").mode(SaveMode.Append).format("snappy").save(pathTowrite);
- I would like to know the actual size of the dataFrame file without the compression
- I rather not read the file from S3 after saving it:
it's compressed - and not the real size. not best resource planning.
How come there is such a huge difference between SizeEstimator and the real size of the file, is this make sense ?
Is there other efficient way of estimate each partition data size (uncompressed) prior to saving it ?
my entire code is in Java- so java solution is preferred.