1

I am new to hadoop and just installed oracle's virtualbox and hortonworks' sandbox. I then, downloaded the latest version of hadoop and imported the jar files into my java program. I copied a sample wordcount program and created a new jar file. I run this jar file as a job using sandbox. The wordcount works perfectly fine as expected. However, in my job status page, I see the number of mappers to my input file is determined as 28. In my input file, I have the following line.

Ramesh is studying at XXXXXXXXXX XX XXXXX XX XXXXXXXXX.

How is the total mappers determined as 28?

I added the below line into my wordcount.java program to check.

FileInputFormat.setMaxInputSplitSize(job, 2);

Also, I would like to know if the input file can contain only 2 rows. (i.e.) Suppose if I have an input file, like below.

row1,row2,row3,row4,row5,row6.......row20

Should I split the input file into 20 different files each having only 2 rows?

Ramesh
  • 765
  • 7
  • 24
  • 52

2 Answers2

3

HDFS block and MapReduce splits are 2 different things. Blocks are physical division of data while a Split is just a logical division done during a MR job. It is the duty of InputFormat to create the Splits from a given set data and based on the number of Splits the number of Mappers is decided. When you use setMaxInputSplitSize, you overrule this behavior and give a Split size of your own. But giving a very small value to setMaxInputSplitSize would be an overkill as there will be a lot of very small Splits, and you'll end up having a lot of unnecessary Map tasks.

Actually I don't see any need for you to use FileInputFormat.setMaxInputSplitSize(job, 2); in your WC program. Also,it looks like you have mistaken the 2 here. It is not the number of lines in a file. It is the Split size, in long, which you would like to have for your MR job. You can have any number of lines in the file which you are going to use as your MR input.

Does this sound OK?

Tariq
  • 34,076
  • 8
  • 57
  • 79
1

That means your input file is split into roughly 28 parts(blocks) in HDFS since, you said 28 map tasks were scheduled- but, not may not be total 28 parallel map task though. Parallelism will depend on the number of slots you'll have in your cluster. I'm talking in terms of Apache Hadoop. I don't know if Horton works did nay modification to this.

Hadoop likes to work with Large files, so, do you want to split your input file to 20 different files?

SSaikia_JtheRocker
  • 5,053
  • 1
  • 22
  • 41
  • I have understood the basic concepts of hadoop (i.e) the file will be processed in pairs..What if my file has 20 rows? How will the mapping take place? Will it be like and then and so on? If so, should I feed the file with only 2 rows? If I have 20 rows in my file, how can I implement the for mapping? – Ramesh Jun 19 '13 at 18:56
  • It'll depend on the InputFormat you choose. By default using a TextFileInputFormat you will have a have the byte offset as the key and a row as the value in the map(). It would be something like the following in different mappers running over your splits: ... ... – SSaikia_JtheRocker Jun 19 '13 at 19:09
  • 1
    And, I think, a mere 20 row file need not be split into several files. – SSaikia_JtheRocker Jun 19 '13 at 19:16
  • I have a weather data which has many details like station_name, year, temp etc (There are actually 20 such rows). If I want to find the avg temp for a particular station, my output would be . To achieve this, how should I map? I am not able to figure out this. – Ramesh Jun 19 '13 at 19:18
  • So, you are saying your data analysis will make sense, if you have all the 20 rows considered together? You have several important fields such as station name, avg_tmp, year, blah1, blah2 separeted across different rows in the input file? – SSaikia_JtheRocker Jun 19 '13 at 19:20
  • yeah.Something of that sort. I can map like below. and for second map and for third map and finally my output can be produced like which I believe can be achieved in the reduction phase. Please correct me if am wrong. – Ramesh Jun 19 '13 at 19:22
  • let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/32036/discussion-between-jtherocker-and-ramesh) – SSaikia_JtheRocker Jun 19 '13 at 19:23
  • @JtheRocker : IMHO, you can not always guarantee that a file is splitted into 28 hdfs blocks if 28 mappers have run. It may overlap, but it's not always true. It is the no. of InputSplits which decides the no. of mappers. – Tariq Jun 19 '13 at 21:15
  • @Tariq, you are correct, we can have different InputFormats that can logically create inputsplits but I think, the OP is a beginner and is using TextInputFormat to do the usual Wordcount. So, when his single input file is generating around 28 mappers, the usual guess for the number of blocks the input file is consuming would roughly be 28. I know that it might not be exactly 28 and there can be compromise from Hadoop's end to maintain proper record boundaries for each inputsplit. – SSaikia_JtheRocker Jun 20 '13 at 12:19
  • BTW, I already gave a +1 to your answer for that clarification. – SSaikia_JtheRocker Jun 20 '13 at 12:22
  • It was just to prevent him from any misconception. Thank you for the +1 :) – Tariq Jun 20 '13 at 12:26