0

I have been trying to join fields from two data sets but have been unsuccessful. I would appreciate if someone could help me achieve this. The files and my code which I have been trying is as follows

movie-metadata

975900  /m/03vyhn   Ghosts of Mars  2001-08-24  14010832    98.0    {"/m/02h40lc": "English Language"}  {"/m/09c7w0": "United States of America"}   {"/m/01jfsb": "Thriller", "/m/06n90": "Science Fiction", "/m/03npn": "Horror", "/m/03k9fj": "Adventure", "/m/0fdjb": "Supernatural", "/m/02kdv5l": "Action", "/m/09zvmj": "Space western"}
3196793 /m/08yl5d   Getting Away with Murder: The JonBenét Ramsey Mystery   2000-02-16      95.0    {"/m/02h40lc": "English Language"}  {"/m/09c7w0": "United States of America"}   {"/m/02n4kr": "Mystery", "/m/03bxz7": "Biographical film", "/m/07s9rl0": "Drama", "/m/0hj3n01": "Crime Drama"}
28463795    /m/0crgdbh  Brun bitter 1988        83.0    {"/m/05f_3": "Norwegian Language"}  {"/m/05b4w": "Norway"}  {"/m/0lsxr": "Crime Fiction", "/m/07s9rl0": "Drama"}
9363483 /m/0285_cd  White Of The Eye    1987        110.0   {"/m/02h40lc": "English Language"}  {"/m/07ssc": "United Kingdom"}  {"/m/01jfsb": "Thriller", "/m/0glj9q": "Erotic thriller", "/m/09blyk": "Psychological thriller"}

character-metadata

975900  /m/03vyhn   2001-08-24  Akooshay    1958-08-26  F   1.62        Wanda De Jesus  42  /m/0bgchxw  /m/0bgcj3x  /m/03wcfv7
975900  /m/03vyhn   2001-08-24  Lieutenant Melanie Ballard  1974-08-15  F   1.78    /m/044038p  Natasha Henstridge  27  /m/0jys3m   /m/0bgchn4  /m/0346l4
975900  /m/03vyhn   2001-08-24  Desolation Williams 1969-06-15  M   1.727   /m/0x67 Ice Cube    32  /m/0jys3g   /m/0bgchn_  /m/01vw26l
975900  /m/03vyhn   2001-08-24  Sgt Jericho Butler  1967-09-12  M   1.75        Jason Statham   33  /m/02vchl6  /m/0bgchnq  /m/034hyc

In the first file, I am interested in first field which is movie id and third field movie name. While in the second file, the first field is movie id and 9th field is actor name(s). There can be multiple actor names for every movieId as shown in file-2 above. The output I am trying to achieve is in the following format

movieId     movieName, actorName1, actorName2, actorName3....etc.

I have been successful in extracting the fields from two mapper classes. In reducer class, my code does not seem to achieve the format above which I intend as output. I get the output as

movieId movieName, actorName1

I do not get the rest of the names of actors. Please have a look at my code and correct me accordingly.

public class Join {
    public static void main(String[] args) throws Exception {
        if (args.length != 3) {
            System.err.println("Usage: Join <input path> <output path>");
            System.exit(-1);
        }

    Configuration conf = new Configuration();
    Job job = Job.getInstance(conf);
    job.setJobName("Join");

    job.setJarByClass(Join.class);
    job.setReducerClass(JoinReduce.class);
    job.setOutputKeyClass(Text.class);
    job.setOutputValueClass(Text.class);

    MultipleInputs.addInputPath(job, new Path(args[0]),
             TextInputFormat.class, JoinMap1.class);
             MultipleInputs.addInputPath(job, new Path(args[1]),
             TextInputFormat.class, JoinMap2.class);
    FileOutputFormat.setOutputPath(job, new Path(args[2]));

    System.exit(job.waitForCompletion(true) ? 0 : 1);
}

public static class JoinMap1 extends
        Mapper<LongWritable, Text, Text, Text> {
    private String movieId, movieName, fileTag = "A~ ";

    @Override
    public void map(LongWritable key, Text value,Context context) 
            throws IOException, InterruptedException {
        String values[] = value.toString().split("\t");
        movieId = values[0].trim();
        movieName = values[2].trim().replaceAll("\t", "movie Name");
        context.write(new Text(movieId), new Text (fileTag + movieName));
    }

}

public static class JoinMap2 extends Mapper<LongWritable, Text, Text, Text>{
    private String movieId, actorName, fileTag = "B~ ";
    @Override
    public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
        String line = value.toString();
        String values[] = line.toString().split("\t");
        movieId = values[0].trim();
        actorName = values[8].trim().replaceAll("\t", "actor Name");
        context.write(new Text (movieId), new Text (fileTag + actorName));
    }
}

public static class JoinReduce extends
        Reducer<Text, Text, Text, Text> {
     private String movieName, actorName;
    @Override
    public void reduce(Text key, Iterable<Text> values, Context context) 
            throws IOException, InterruptedException 
    { 
        for (Text value : values){
            String currValue = value.toString();
            String splitVals[] = currValue.split("~");
            if(splitVals[0].equals("A")){
                movieName = splitVals[1] != null ? splitVals[1].trim() : "movieName";
            } else if (splitVals[0].equals("B")){
                actorName= splitVals[1] != null ? splitVals[1].trim() : "actorName";
            }  
        }
        context.write(key, new Text (movieName + ", " + actorName));
}
}
}

Please suggest me what can be done so that I can achieve the output as shown above. Any help would be greatly appreciated. Bricks and bats are welcome.

Uzair Syed
  • 31
  • 5

2 Answers2

0

Even though your code iterates through all the values, it doesn't seem to accumulate actor names, rather it keeps overriding the current actor name with new ones.
Instead of this:

actorName= splitVals[1] != null ? splitVals[1].trim() : "actorName";

Try this:

actorName += splitVals[1] != null ? splitVals[1].trim() : "actorName" + ",";
Mr.Chowdary
  • 3,389
  • 9
  • 42
  • 66
Gwen Shapira
  • 4,978
  • 1
  • 24
  • 23
  • Hi Gwen, I made changes to the code as suggested but the MapReduce job fails throwing the error File /user/uzair/output/joinout/_temporary/_attempt_201411290045_0001_r_000000_1/part-r-00000 could only be replicated to 0 nodes, instead of 1. The job runs successfully if no changes are made to the code. While the job was running (with changes made as suggested), I checked the _temporary/_attempt_201411290045_0001_r_000000_1/part-r-00000 where each record has movieId, movieName but for actorNames, all the actornames from all the records irrespective of the key are being appended. Please advise. – Uzair Syed Nov 28 '14 at 19:43
-1

Hi~ I just read throw your code. I have same suggest as Gwen. If you want your result records with "Movie ID" + "Movie name" + "Actors". You must put all the output value into context.write() in the same time. So what Gwen suggested is what you have to do.

I think the job fail is not Mapreduce problem, but a HDFS one. Check out "Hadoop: File … could only be replicated to 0 nodes, instead of 1".

One thing I'm curious is the JoinMap2 part.

String values[] = line.toString().split("\t");
movieId = values[0].trim();
actorName = values[8].trim().replaceAll("\t", "actor Name");

You split the line with "\t", so it means that there must have no "\t" inside of any cell of values[].

So you really want it to do at the 3rd line? to replace "\t" to "actor name"? There are no "\t" in values[8].

You need at least do three things to complete your MapReduce job.

  1. Fix your HDFS.
  2. Rewrite JoinMap2 to make sure it does output the answer you want. the actors.
  3. Rewrite reducer, just as Gwen said.
Community
  • 1
  • 1
Eefy
  • 1
  • 1
  • Hi.Thanks for your response. I have put third line in my code u jst quotd above as I was testng something else.I removed it (replaceAll). I get the HDFS prob only when I change the reducer code as Gwen suggested. There is no problem with HDFS. What actually happening is, all values for similar keys r not being treated as same key.For eg.key:movieId 0001 for the values:movieName "SomeMovie" & same key:movieId 0001 for the values:actorName "SomeActor" r being treated as different key. I am sorry if I am unable to express. I can update the code & output I get if you don't get what I just said. – Uzair Syed Nov 30 '14 at 09:03