0

In my application, I have a spark dataset of X rows I have different CSV files each one with different size and structure. I'm generating a Dataset over this CSV's.

Before posting this question I saw these questions:

I need to calculate the size of each partition during runtime The result of the files are ORC(snappy compression)

all of the above questions offering to use Size Estimator

So I also read about Size Estimator

When i did try to use this Size Estimator of the

SizeEstimator.estimate(dataFrame.rdd().partitions())

I got this results: 71.124 MB, I have also try to use estimate of a sample with partials file reading - which results in the same size.

Seeing this result - just don't make sense, Here some more details:

Source file size 44.8 KB (CSV) - 300 rows.

SizeEstimator.estimate(dataSet.rdd().partitions()) 71.124 MB

The actual data frame results on run time are stored to S3:

dataSet.write().partitionBy(partitionColumn).option("header", "true").mode(SaveMode.Append).format("snappy").save(pathTowrite);
  • I would like to know the actual size of the dataFrame file without the compression
  • I rather not read the file from S3 after saving it:
  • it's compressed - and not the real size. not best resource planning.

    1. How come there is such a huge difference between SizeEstimator and the real size of the file, is this make sense ?

    2. Is there other efficient way of estimate each partition data size (uncompressed) prior to saving it ?

my entire code is in Java- so java solution is preferred.

2Big2BeSmall
  • 1,348
  • 3
  • 20
  • 40

1 Answers1

0

For now, I was able to use a temp solution which is not efficient but quite close to what I need

Spark DataSet efficiently get length size of entire row

2Big2BeSmall
  • 1,348
  • 3
  • 20
  • 40