Estimate Spark DataSet by partition size - uncompressed

Question

In my application, I have a spark dataset of X rows I have different CSV files each one with different size and structure. I'm generating a Dataset over this CSV's.

Before posting this question I saw these questions:

I need to calculate the size of each partition during runtime The result of the files are ORC(snappy compression)

all of the above questions offering to use Size Estimator

So I also read about Size Estimator

When i did try to use this Size Estimator of the

SizeEstimator.estimate(dataFrame.rdd().partitions())

I got this results: 71.124 MB, I have also try to use estimate of a sample with partials file reading - which results in the same size.

Seeing this result - just don't make sense, Here some more details:

Source file size 44.8 KB (CSV) - 300 rows.

SizeEstimator.estimate(dataSet.rdd().partitions()) 71.124 MB

The actual data frame results on run time are stored to S3:

dataSet.write().partitionBy(partitionColumn).option("header", "true").mode(SaveMode.Append).format("snappy").save(pathTowrite);

I would like to know the actual size of the dataFrame file without the compression
I rather not read the file from S3 after saving it:
it's compressed - and not the real size. not best resource planning.
1. How come there is such a huge difference between SizeEstimator and the real size of the file, is this make sense ?
2. Is there other efficient way of estimate each partition data size (uncompressed) prior to saving it ?

my entire code is in Java- so java solution is preferred.

score 0 · Accepted Answer · answered Mar 22 '18 at 09:36

0

For now, I was able to use a temp solution which is not efficient but quite close to what I need

Spark DataSet efficiently get length size of entire row

answered Mar 22 '18 at 09:36

2Big2BeSmall

1,348
3
20
40

Estimate Spark DataSet by partition size - uncompressed

1 Answers1