1

I want to store some information of the files being processed from HDFS. What would be the most suitable way to read a file location and byte offset in a java program of a file stored in HDFS?

Is there concept of a unique file id being associated to each file stored in Hadoop 1? If yes, then how can it be fetched in a MapReduce program?

Anish Gupta
  • 293
  • 1
  • 5
  • 18

2 Answers2

2

As per my understanding,
You can use org.apache.hadoop.fs.FileSystem class for all your needs.
1.You can get each file uniquely identified with it's URI or you can use getFileChecksum(Path path)
2.You can get all files all block locations with the getFileBlockLocations(FileStatus file,long start,long len)
TextInputFormat gives byte offset for key starting location in the file, which is not same as the file offset on the HDFS.
You can use the org.apache.hadoop.fs.FileSystem class to fulfill all your needs. There are many other methods available. Please go through it for better understanding.
Hope it helps.

Mr.Chowdary
  • 3,389
  • 9
  • 42
  • 66
0

According to "The Definitive Guide to Hadoop", the input format TextInputFormat gives to the key a value of the byte offset.

For the filename you can look into these:

Mapper input Key-Value pair in Hadoop

How can to get the filename from a streaming mapreduce job in R?

Community
  • 1
  • 1
user1676389
  • 73
  • 10