0

How do I estimate size of the file before writing to HDFS ?, I am using Apache Spark for this exercise. I read a file from HDFS and apply a filter, then write back to HDFS, but before writing to HDFS , want to know the file size.

OneCricketeer
  • 179,855
  • 19
  • 132
  • 245
Bill
  • 363
  • 3
  • 14
  • 2
    Which file format do you use? The same data written as csv would consume more space than a (compressed) parquet file – werner Apr 19 '18 at 17:33
  • I will be writing json as snappy compressed format, no , I do not want the size of RDD or dataframe, that is very different from the hdfs file size. – Bill Apr 19 '18 at 21:25
  • I don't think you can find the size of a *serialized RDD/DataFrame* before it's actually deserialized and written to disk. In theory, you can `collect()` it locally (as you would need to do anyway), convert that output to a byte array, convert to the compressed Snappy data as you want, then find the size of that byte array – OneCricketeer Apr 20 '18 at 02:41

0 Answers0