0

Big Data Hadoop 1st Generation. I am very new to Apache Hadoop. I just got a doubt may be my question is irrelevant.

Problem : Word count problem (Dry debug).

Example :

File Name : test.txt

File Size : 120 MB

Default Block size : 64 MB

File Content :

Hello StackOverflow
Hi StackOverflow
Hola StackOverflow
Mushi Mushi StackOverflow
.....
.....
.....
Mushi Mushi StackOverflow

Number of blocks will be : 2 (64 MB + 56 MB)

Block 1 contains :

Hello StackOverflow
Hi StackOverflow
Hola StackOverflow
Mushi Mus

Block 2 contains :

hi StackOverflow
.....
.....
.....
Mushi Mushi StackOverflow

NOTE : Here Mushi word splits between block 1 and block 2, because at word "Mus" block size became 64 MB, remaining word "hi" went into Block 2.

Now my question's are : Q1) Is it possible scenario ?

Q2) If No Why ?

Q3) If Yes, then what will be the word count output.

Q4) What will be the Mapper's output for both blocks.

user3676578
  • 213
  • 1
  • 5
  • 17

1 Answers1

0

MapReduce framework works on InputSplit rather than HDFS Blocks.

Have a look at below SE post for better understanding of InputSplit & number of mappers for a given file.

How does Hadoop process records split across block boundaries?

Default number of reducers

Regarding your questions:

Q1) Is it possible scenario ?

Yes. Possible.

Q3) If Yes, then what will be the word count output.

The data in Block-2 will be copied on Mapper node, which is processing the InputSplit.

Update:

Regarding your other query in comments, have a look at below line from Hadoop definitive guide:

The logical records that FileInputFormats define usually do not fit neatly into HDFS blocks. For example, a TextInputFormat’s logical records are lines, which will cross HDFS boundaries more often than not. This has no bearing on the functioning of your program — lines are not missed or broken, for example — but it’s worth knowing about because it does mean that data-local maps (that is, maps that are running on the same host as their input data) will perform some remote reads. The slight overhead this causes is not normally significant.

In absence of Remote read, your HDFS block is InputSplit in Mapper node. If record crosses boundaries of Mapper nodes, Remote read will fetch the data to first Mapper node where majority of the data is present.

Community
  • 1
  • 1
Ravindra babu
  • 37,698
  • 11
  • 250
  • 211
  • Some extend I understood that input-splits is logical representation and its default size is your block size i.e. 64 MB. but how it will accommodate data (Which went into other blocks) from other blocks if it(input-split 1) has completed 64MB space, Will it increase the size of input splits automatically ? – user3676578 Feb 09 '17 at 10:23
  • InputSplit size will be increased and data will be loaded in RAM of Mapper node. – Ravindra babu Feb 09 '17 at 10:59
  • But Mapper Node will not be knowing anything about the other Data Node, Then How will it get the data from other Mapper Node which are in RAM ? Sorry but I am unable to create a picture in my mind. – user3676578 Feb 09 '17 at 11:14
  • Refer to linked question in answer: http://stackoverflow.com/questions/14291170/how-does-hadoop-process-records-split-across-block-boundaries/34737075#34737075 : In cases where the last record in a block is incomplete, the input split includes location information for the next block and the byte offset of the data needed to complete the record.. – Ravindra babu Feb 09 '17 at 11:51
  • How Remote reader/Input-splits will distinguish the word, because at hard-disk everything has stored at bit-level(0 and 1). – user3676578 Feb 10 '17 at 05:32