I am seeing lots of errors like the following when running a Spark job on Google Dataproc while the program tries to access a file in the attached HDFS (yes, I am in bioinformatics, if that matters):
Caused by:
org.apache.hadoop.hdfs.BlockMissingException:
Could not obtain block:
BP-605637428-10.128.0.34-1564425505397:blk_1073741871_1047
file=/reference/Homo_sapiens_assembly38.fasta
When I parsed the log, the exceptions are repeatedly complaining about accessing 4~5 blocks. That file is ~3GB and block size on the HDFS is set to ~138MB.
Then I went to hadoop fsck /reference/Homo_sapiens_assembly38.fasta
, and got the following
.Status: HEALTHY
Total size: 3249912778 B
Total dirs: 0
Total files: 1
Total symlinks: 0
Total blocks (validated): 25 (avg. block size 129996511 B)
Minimally replicated blocks: 25 (100.0 %)
Over-replicated blocks: 0 (0.0 %)
Under-replicated blocks: 0 (0.0 %)
Mis-replicated blocks: 0 (0.0 %)
Default replication factor: 2
Average block replication: 2.0
Corrupt blocks: 0
Missing replicas: 0 (0.0 %)
Number of data-nodes: 8
Number of racks: 1
I also tried setting dfs.client.use.datanode.hostname
to true
when creating the dataproc cluster as indicated here and here, but no success either.
I am also planning to increase the dfs.replication
to 3 from the Dataproc default 2, but Google says this so I am not sure yet if that'll impact the performance.
Any one has idea about what is happening?