java.io.IOException: Not a data file

Question

I am processing a bunch of avro files which are stored in a nested directory structure in HDFS. The files are stored in year/month/day/hour format directory structure.

I wrote this simple code to process

sc.hadoopConfiguration.set("mapreduce.input.fileinputformat.input.dir.recursive","true")
val rootDir = "/user/cloudera/rootDir"
val rdd1 = sc.newAPIHadoopFile[AvroKey[GenericRecord], NullWritable, AvroKeyInputFormat[GenericRecord]](rootDir)
rdd1.count()

I get an exception which I have pasted below. The biggest problem I am facing is that it doesn't tell me which file is not a data file. So I will have to go in HDFS and scan through 1000s of files to see which one was not a data file.

is there a more efficient way to debug/solve this?

5/11/01 19:01:49 WARN TaskSetManager: Lost task 1084.0 in stage 14.0 (TID 11562, datanode): java.io.IOException: Not a data file.
    at org.apache.avro.file.DataFileStream.initialize(DataFileStream.java:102)
    at org.apache.avro.file.DataFileReader.<init>(DataFileReader.java:97)
    at org.apache.avro.mapreduce.AvroRecordReaderBase.createAvroFileReader(AvroRecordReaderBase.java:183)
    at org.apache.avro.mapreduce.AvroRecordReaderBase.initialize(AvroRecordReaderBase.java:94)
    at org.apache.spark.rdd.NewHadoopRDD$$anon$1.<init>(NewHadoopRDD.scala:133)
    at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:104)
    at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:66)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
    at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
    at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
    at org.apache.spark.scheduler.Task.run(Task.scala:64)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
    at java.lang.Thread.run(Thread.java:745)

SparkleGoat · Accepted Answer · 2016-05-31T14:03:25.447

One of the nodes on your cluster where the block is located are down. The data is not found because of that, which gives the error. The solution is to repair and bring up all the nodes in the cluster.

I was getting the exact error below with my Java map reduce program that uses avro input. Below is a rundown of the issue.

Error: java.io.IOException: Not a data file.    at
org.apache.avro.file.DataFileStream.initialize(DataFileStream.java:102)
at org.apache.avro.file.DataFileReader.<init>(DataFileReader.java:97)
at org.apache.avro.mapreduce.AvroRecordReaderBase.createAvroFileReader(AvroRecordReaderBase.java:183)   at
org.apache.avro.mapreduce.AvroRecordReaderBase.initialize(AvroRecordReaderBase.java:94) at org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.initialize(MapTask.java:548)
    at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:786)
    at org.apache.hadoop.mapred.MapTask.run(MapTask.java:341)   at
org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:168)    at
 java.security.AccessController.doPrivileged(Native Method)     at javax.security.auth.Subject.doAs(Subject.java:422)   at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:162)

I decided to cat the file because I was able to run the program over another file in the same folder of HDFS and receive the following.

INFO hdfs.DFSClient: No node available for <Block location in your
cluster> from any node: java.io.IOException: No live nodes contain
 block BP-6168826450-10.1.10.123-1457116155679:blk_1073853378_112574
 after checking nodes = [], ignoredNodes = null No live nodes contain
 current block Block locations: Dead nodes: . Will get new block
 locations from namenode and retry...

We have been having some problems with our cluster and unfortunately some nodes were down. After remedy of the problem this error was resolved

This is an answer, the solution is in bold. One of the nodes are down where the block is located. The data is not found because of that, which gives the error. The solution is to repair and bring up all the nodes in the cluster. — SparkleGoat, May 27 '16 at 01:30
You can probably incorporate the message in your comment into your answer. — pinkpanther, May 29 '16 at 21:51

score 1 · Answer 2 · answered Jun 13 '17 at 16:01

1

I was getting the same error while reading avro files in my map reduce job. Investigating a little, I found out that the avro files on which the MapTasks fail are all zero byte avro files. Looks like MapReduce is unable to handle zero byte files.

answered Jun 13 '17 at 16:01

Heapify

2,581
17
17

score 0 · Answer 3 · answered Mar 16 '19 at 00:13

In my case, I was trying to read the data using DataFileReader which expects the data to be in a certain format (written using DataFileWriter) but my data file was handcrafted so I was getting this error.

I got around this problem by using JsonDecoder which takes the schema and the Avro record as parameters and returns a decoder. This decoder can then be used with GenericDatumReader to read your GenericRecord. Here's the Scala code for your reference.

    val avroJson = Source.fromURL(getClass.getResource("/record.avro")).mkString
    val decoderFactory: DecoderFactory = new DecoderFactory
    val decoder: Decoder = decoderFactory.jsonDecoder(schema, avroJson)

    val datumReader = new GenericDatumReader[GenericRecord](schema)
    var avroRecord: GenericRecord = datumReader.read(null, decoder)

HTH.

java.io.IOException: Not a data file

3 Answers3

Linked