4

I am seeing lots of errors like the following when running a Spark job on Google Dataproc while the program tries to access a file in the attached HDFS (yes, I am in bioinformatics, if that matters):

Caused by: 
org.apache.hadoop.hdfs.BlockMissingException: 
Could not obtain block: 
BP-605637428-10.128.0.34-1564425505397:blk_1073741871_1047 
file=/reference/Homo_sapiens_assembly38.fasta

When I parsed the log, the exceptions are repeatedly complaining about accessing 4~5 blocks. That file is ~3GB and block size on the HDFS is set to ~138MB.

Then I went to hadoop fsck /reference/Homo_sapiens_assembly38.fasta, and got the following

.Status: HEALTHY
 Total size:    3249912778 B
 Total dirs:    0
 Total files:   1
 Total symlinks:        0
 Total blocks (validated):  25 (avg. block size 129996511 B)
 Minimally replicated blocks:   25 (100.0 %)
 Over-replicated blocks:    0 (0.0 %)
 Under-replicated blocks:   0 (0.0 %)
 Mis-replicated blocks:     0 (0.0 %)
 Default replication factor:    2
 Average block replication: 2.0
 Corrupt blocks:        0
 Missing replicas:      0 (0.0 %)
 Number of data-nodes:      8
 Number of racks:       1

I also tried setting dfs.client.use.datanode.hostname to true when creating the dataproc cluster as indicated here and here, but no success either.

I am also planning to increase the dfs.replication to 3 from the Dataproc default 2, but Google says this so I am not sure yet if that'll impact the performance.

Any one has idea about what is happening?

CloudyTrees
  • 711
  • 2
  • 10
  • 20
  • 1
    Check the health of your nodes on the resource manager, it appears to me some of your data nodes are turning unhealthy and eventually going down/becoming unresponsive, which is the reason why your job is failing with block missing exception. – VenkateswaraCh Jul 31 '19 at 17:38
  • @allbutlinear Thanks! But do you know how to do that for a Dataproc (a link to some doc would be perfect)? – CloudyTrees Jul 31 '19 at 17:41
  • 1
    Please find the [Link](https://cloud.google.com/dataproc/docs/concepts/accessing/cluster-web-interfaces), look for YARN ResourceManager – VenkateswaraCh Jul 31 '19 at 18:06
  • Can you instead put that file on GCS? – Karthik Palaniappan Aug 04 '19 at 19:55
  • @KarthikPalaniappan, I actually tried that, and it turns out that the code I rely on reads from the GCS in a very inefficient manner. So I need to understand why this is happening. Thanks for the input though! – CloudyTrees Aug 05 '19 at 14:11
  • You can also look through your datanode logs (with stackdriver/pantheon/cloud console) to identify more details on the problem that is causing unavailability of blocks. – Aniket Mokashi Aug 07 '19 at 17:32
  • What were the settings used to create the cluster? Are the disks attached to the nodes extremely small? Even if HDFS isn't the thing responsible for using up disk space, if you filled up the disks with logs or temporary data, then HDFS would run out of space too and datanodes would become unhealthy. – Dennis Huo Aug 21 '19 at 01:10
  • @CloudyTrees any update on this? – jasper Jun 29 '20 at 17:11
  • @jasper sorry, I've put this personal project on hold. I'll get back to this, once the current deadline urgent project is done. – CloudyTrees Jun 29 '20 at 20:05
  • @CloudyTrees I’m having this issue at work, when I fix it I’ll get back to you – jasper Jun 29 '20 at 21:07

1 Answers1

0

I had this problem as well and the problem was that the input file was corrupted. I just uploaded the file to HDFS again and it worked fine.

Dharman
  • 30,962
  • 25
  • 85
  • 135
lgonzales
  • 101
  • 4