How do I estimate size of the file before writing to HDFS ?, I am using Apache Spark for this exercise. I read a file from HDFS and apply a filter, then write back to HDFS, but before writing to HDFS , want to know the file size.
Asked
Active
Viewed 87 times
0
-
2Which file format do you use? The same data written as csv would consume more space than a (compressed) parquet file – werner Apr 19 '18 at 17:33
-
I will be writing json as snappy compressed format, no , I do not want the size of RDD or dataframe, that is very different from the hdfs file size. – Bill Apr 19 '18 at 21:25
-
I don't think you can find the size of a *serialized RDD/DataFrame* before it's actually deserialized and written to disk. In theory, you can `collect()` it locally (as you would need to do anyway), convert that output to a byte array, convert to the compressed Snappy data as you want, then find the size of that byte array – OneCricketeer Apr 20 '18 at 02:41