Estimate the size of file before writing to HDFS in Apache Spark

Asked Apr 19 '18 at 15:54

Active Apr 20 '18 at 02:42

Viewed 87 times

How do I estimate size of the file before writing to HDFS ?, I am using Apache Spark for this exercise. I read a file from HDFS and apply a filter, then write back to HDFS, but before writing to HDFS , want to know the file size.

edited Apr 20 '18 at 02:42

OneCricketeer

179,855
19
132
245

asked Apr 19 '18 at 15:54

Bill

2

Which file format do you use? The same data written as csv would consume more space than a (compressed) parquet file – werner Apr 19 '18 at 17:33
I will be writing json as snappy compressed format, no , I do not want the size of RDD or dataframe, that is very different from the hdfs file size. – Bill Apr 19 '18 at 21:25
I don't think you can find the size of a *serialized RDD/DataFrame* before it's actually deserialized and written to disk. In theory, you can `collect()` it locally (as you would need to do anyway), convert that output to a byte array, convert to the compressed Snappy data as you want, then find the size of that byte array – OneCricketeer Apr 20 '18 at 02:41

Estimate the size of file before writing to HDFS in Apache Spark

0 Answers0