0

As discussed in several other questions (here and here), the hadoop fs -du -s -h command (or equivalently hdfs dfs -s -h) shows two values:

  1. The pure file size
  2. The file size taking into account replication

e.g.

19.9 M  59.6 M  /path/folder/test.avro

So normally we'd expect the second number to be 3x the first number, on our cluster with replication factor 3.

But when checking up on a running Spark job recently, the first number was 246.9 K, and the second was 3.4 G - approximately 14,000 times larger!

Does this indicate a problem? Why isn't the replicated size 3x the raw size?

Is this because one of the values takes into account block size, and the other doesn't, perhaps?

The Hadoop documentation on this command isn't terribly helpful, stating only:

The du returns three columns with the following format

size disk_space_consumed_with_all_replicas full_path_name

DNA
  • 42,007
  • 12
  • 107
  • 146
  • 248K or 248MB? 3.4GB is 14000x 248K. Also, you can change replication factor for your spark app (see https://stackoverflow.com/questions/46098118/how-can-i-change-hdfs-replication-factor-for-my-spark-program). – tk421 Jul 01 '19 at 20:48
  • Thanks, now fixed. Just to confirm that the replication factor is currently 3. – DNA Jul 02 '19 at 14:45

0 Answers0