Any suggestions for reading two different dataset into Hadoop at the same time?

Question

Dear hadooper: I'm new for hadoop, and recently try to implement an algorithm.

This algorithm needs to calculate a matrix, which represent the different rating of every two pair of songs. I already did this, and the output is a 600000*600000 sparse matrix which I stored in my HDFS. Let's call this dataset A (size=160G)

Now, I need to read the users' profiles to predict their rating for a specific song. So I need to read the users' profile first(which is 5G size), let call this dataset B, and then calculate use the dataset A.

But now I don't know how to read the two dataset from a single hadoop program. Or can I read the dataset B into RAM then do the calculation?( I guess I can't, because the HDFS is a distribute system, and I can't read the dataset B into a single machine's memory).

Any suggestions?

This might help http://stackoverflow.com/questions/4593243/hadoop-job-taking-input-files-from-multiple-directories — QuinnG, Apr 11 '11 at 10:14
I would advise you to use either Pig or Hive (google for them). Then implement this as a join from user profiles to song data. I'd also look into the Mahout Hadoop machine learning system. Implementing joins in Hadoop via its native Java API is really annoying. — Spike Gronim, Apr 11 '11 at 22:39
Thx Spike... Mahout did gave a implementation for pre-compute the diff-matrix for SlopeOne, but it didn't offer a complete hadoop version of Slopeone algorithm. I'll try hive anyway. Thank you for your suggestion — Ke Xie, Apr 16 '11 at 11:57

Tanveer · Answer 1 · 2014-03-21T05:41:38.417

You can use two Map function, Each Map Function Can process one data set if you want to implement different processing. You need to register your map with your job conf. For eg:

           public static class FullOuterJoinStdDetMapper extends MapReduceBase implements Mapper <LongWritable ,Text ,Text, Text>
    {
            private String person_name, book_title,file_tag="person_book#";
            private String emit_value = new String();
            //emit_value = "";
            public void map(LongWritable key, Text values, OutputCollector<Text,Text>output, Reporter reporter)
                     throws IOException
            {
                    String line = values.toString();
                    try
                    {
                            String[] person_detail = line.split(",");
                            person_name = person_detail[0].trim();
                            book_title = person_detail[1].trim();
                    }
                    catch (ArrayIndexOutOfBoundsException e)
                    {
                            person_name = "student name missing";
                     }
                    emit_value = file_tag + person_name;
                    output.collect(new Text(book_title), new Text(emit_value));
            }

    }


       public static class FullOuterJoinResultDetMapper extends MapReduceBase implements  Mapper <LongWritable ,Text ,Text, Text>
     {
            private String author_name, book_title,file_tag="auth_book#";
            private String emit_value = new String();

// emit_value = ""; public void map(LongWritable key, Text values, OutputCollectoroutput, Reporter reporter) throws IOException { String line = values.toString(); try { String[] author_detail = line.split(","); author_name = author_detail[1].trim(); book_title = author_detail[0].trim(); } catch (ArrayIndexOutOfBoundsException e) { author_name = "Not Appeared in Exam"; }

                          emit_value = file_tag + author_name;                                     
                         output.collect(new Text(book_title), new Text(emit_value));
                 }

             }


       public static void main(String args[])
                    throws Exception
    {

           if(args.length !=3)
                    {
                            System.out.println("Input outpur file missing");
                            System.exit(-1);
                    }


            Configuration conf = new Configuration();
            String [] argum = new GenericOptionsParser(conf,args).getRemainingArgs();
            conf.set("mapred.textoutputformat.separator", ",");
            JobConf mrjob = new JobConf();
            mrjob.setJobName("Inner_Join");
            mrjob.setJarByClass(FullOuterJoin.class);

            MultipleInputs.addInputPath(mrjob,new Path(argum[0]),TextInputFormat.class,FullOuterJoinStdDetMapper.class);
            MultipleInputs.addInputPath(mrjob,new Path(argum[1]),TextInputFormat.class,FullOuterJoinResultDetMapper.class);

            FileOutputFormat.setOutputPath(mrjob,new Path(args[2]));
            mrjob.setReducerClass(FullOuterJoinReducer.class);

            mrjob.setOutputKeyClass(Text.class);
            mrjob.setOutputValueClass(Text.class);

            JobClient.runJob(mrjob);
    }

score 0 · Answer 2 · edited Aug 06 '19 at 19:48

Hadoop allows you to use different map input formats for different folders. So you can read from several datasources and then cast to specific type in Map function i.e. in one case you got (String,User) in other (String,SongSongRating) and you Map signature is (String,Object). The second step is selection recommendation algorithm, join those data in some way so aggregator will have least information enough to calculate recommendation.

Any suggestions for reading two different dataset into Hadoop at the same time?

2 Answers2