I have a cluster up and running (HDP-2.3.0.0-2557), it consists of 10 physical servers (2 management servers and 8 data nodes all of which are healthy). The cluster (HDFS) was loaded with an initial dataset of roughly 4Tb of data over a month ago. Most importantly, after loading there were no reports of any missing or corrupt blocks!
I loaded up the Ambari dashboard after a month of not using the system at all and under the HDFS summary - Block Error section I am seeing "28 missing / 28 under replicated". The servers have not been used at all particularly no map reduce jobs and no new files read or written to/from HDFS. How is possible that 28 blocks are now reported as corrupt?
The original data source which resides on one 4Tb disk has no missing blocks, no corrupt files or anything of the sort and is working just fine! Having the data in triplicate using HDFS should surely safeguard me against files being lost/corrupt.
I have run all the suggested fsck commands and can see lines such as:
/user/ambari-qa/examples/input-data/rawLogs/2010/01/01/01/40/log05.txt: MISSING 1 blocks of total size 15 B...........
/user/ambari-qa/examples/src/org/apache/oozie/example/DemoMapper.java: CORRUPT blockpool BP-277908767-10.13.70.142-1443449015470 block blk_1073742397
I convinced my manager Hadoop was the way forward due to is impressive resilience claims but this example proves (to me at least) that HDFS is floored? Perhaps I'm doing something wrong but surely I should not have to go searching round a file system for missing blocks. I need to get back to my manager with an explanation, if one of these 28 missing files was critical then HDFS would have landed me in hot water! At this point in time my manager thinks HDFS is not fit for purpose!
I must be missing something or doing something wrong, surely files/blocks stored in triplicate are 3 times less likely to go missing?! The concept is if one data node is taken offline then a file is marked as under replicated and eventually copied to another data node.
In summary: A default install of HDP was installed with all services started. 4Tb of data copied to HDFS with no reported errors (all blocks are stored with default triplicate replication). Everything left standing for 1 month. HDFS summary reporting 28 missing files (no disk errors on any of the 9 data nodes encountered).
Has anyone else had a similar experience?
Last section output from "hdfs fsck /" command:
Total size: 462105508821 B (Total open files size: 1143 B)
Total dirs: 4389
Total files: 39951
Total symlinks: 0 (Files currently being written: 13)
Total blocks (validated): 41889 (avg. block size 11031667 B) (Total open file blocks (not validated): 12)
********************************
UNDER MIN REPL'D BLOCKS: 40 (0.09549046 %)
dfs.namenode.replication.min: 1
CORRUPT FILES: 40
MISSING BLOCKS: 40
MISSING SIZE: 156470223 B
CORRUPT BLOCKS: 28
********************************
Minimally replicated blocks: 41861 (99.93316 %)
Over-replicated blocks: 0 (0.0 %)
Under-replicated blocks: 0 (0.0 %)
Mis-replicated blocks: 0 (0.0 %)
Default replication factor: 3
Average block replication: 2.998138
Corrupt blocks: 28
Missing replicas: 0 (0.0 %)
Number of data-nodes: 8
Number of racks: 1
FSCK ended at Thu Dec 24 03:18:32 CST 2015 in 979 milliseconds
The filesystem under path '/' is CORRUPT
Thanks for reading!