hadoop fs -du output does not reflect replication factor

Asked Jul 01 '19 at 16:25

Active Jul 02 '19 at 14:45

Viewed 421 times

As discussed in several other questions (here and here), the hadoop fs -du -s -h command (or equivalently hdfs dfs -s -h) shows two values:

e.g.

19.9 M  59.6 M  /path/folder/test.avro

So normally we'd expect the second number to be 3x the first number, on our cluster with replication factor 3.

But when checking up on a running Spark job recently, the first number was 246.9 K, and the second was 3.4 G - approximately 14,000 times larger!

Does this indicate a problem? Why isn't the replicated size 3x the raw size?

Is this because one of the values takes into account block size, and the other doesn't, perhaps?

The Hadoop documentation on this command isn't terribly helpful, stating only:

The du returns three columns with the following format

size disk_space_consumed_with_all_replicas full_path_name

edited Jul 02 '19 at 14:45

asked Jul 01 '19 at 16:25

DNA

248K or 248MB? 3.4GB is 14000x 248K. Also, you can change replication factor for your spark app (see https://stackoverflow.com/questions/46098118/how-can-i-change-hdfs-replication-factor-for-my-spark-program). – tk421 Jul 01 '19 at 20:48
Thanks, now fixed. Just to confirm that the replication factor is currently 3. – DNA Jul 02 '19 at 14:45

0 Answers0