0

I have to do certain operations on my input data and write it to hdfs using mapreduce program. My input data looks like

abc  
some data  
some data  
some data  
def  
other data  
other data  
other data 

and continues in the same way, where abc ,def are the headers and some data are records with tab space.

My task is to eliminate the headers and append it to its below records like

some data abc  
some data abc  
some data abc  
other data def  
other data def  
other data def  

Each header will have 50 records.

I am using the default record reader so it reads each line at a time

Now my problem is how do I know that map function has been called for a nth time? Do I have any counter to know that? So that I can use that counter to append the header to string as

if (counter % 50 ==0 )
   *some code*

Or else static variables are the only way?

YoungHobbit
  • 13,254
  • 9
  • 50
  • 73
Abhinay
  • 612
  • 1
  • 6
  • 23

1 Answers1

1

You can use member variables to keep the count, how many have processed till now. The member variable are instance variables and will not be reset each time map function get called. You can instantiate them in mapper setup method.

Obviously, you can use static variable for keeping the counter.

The data in HDFS is stored in blocks, how are you going to handle when data is split in two blocks.

To handle the data split between two blocks, you might need the Reducers. The property of the reducers is, all the data (values) related to a particular key are always sent to the same (single) reducer. The input to the reducer is key and list of values which is in your case list of data. So you can store them very easily as per your requirement.

Optimization : You can use the same Reducer code as Combiner for optimizing the data transfer.

Idea : The Mapper will emit the key and value as it is. Now when the Reducer receive the data, which is Key, List<value>, all of your values are already combined by the MapReduce framework. You just to need to emit them again. This is the output you are looking for.

YoungHobbit
  • 13,254
  • 9
  • 50
  • 73
  • I am new to hadoop and java, I am not sure how it is handled. Doesnt the framework takes care of data splits in different blocks? I found this relevant http://stackoverflow.com/questions/14291170/how-does-hadoop-process-records-split-across-block-boundaries – Abhinay Sep 24 '15 at 10:56
  • I hope you got the member variable part of the answer. By data will split I meant as you have 50 data values which belongs to one key. This will not be taken care by mapreduce. The record reader takes care when parts of the record are split in two blocks. – YoungHobbit Sep 24 '15 at 11:09
  • Now I understand your question, I will try to figure it out. Pls share if you have any suggestion on that. Thanks... – Abhinay Sep 24 '15 at 11:12
  • You can use the reducer for merging all the data (values) related to a particular key. – YoungHobbit Sep 24 '15 at 12:14
  • @Abhinay I have added some more information. Please check. – YoungHobbit Sep 24 '15 at 13:02
  • Thanks YoungHobbit ! I will try to implement the same.. One question in my mind, How do I take 50 lines of records as values for one key(header in my case). Usually one line would be passed as (key,value) right? Should I create some customize record reader or something? – Abhinay Sep 25 '15 at 08:44