As discussed in several other questions (here and here), the hadoop fs -du -s -h
command (or equivalently hdfs dfs -s -h
) shows two values:
- The pure file size
- The file size taking into account replication
e.g.
19.9 M 59.6 M /path/folder/test.avro
So normally we'd expect the second number to be 3x the first number, on our cluster with replication factor 3.
But when checking up on a running Spark job recently, the first number was 246.9 K, and the second was 3.4 G - approximately 14,000 times larger!
Does this indicate a problem? Why isn't the replicated size 3x the raw size?
Is this because one of the values takes into account block size, and the other doesn't, perhaps?
The Hadoop documentation on this command isn't terribly helpful, stating only:
The du returns three columns with the following format
size disk_space_consumed_with_all_replicas full_path_name