Which one to choose: Hadoop custom jars, Hadoop Streaming, Pig, or Mahout?

Question

I am working on an NLP project that can create entity sets and compute pairwise similarity from large-scale corpora. Currently I am using hadoop streaming and have implemented all the mappers and reducers in Python. Since the algorithm needs several rounds of map-reduce, I use Shell scripts to chain the jobs.

Now here are my concerns and what I want to do next:

[Concern 1]. Job chaining and job controls. Chaining hadoop streaming jobs is problematic. If job2 in the job sequence (job1-job2-job3) failed, I have to manually delete the output folder, adjust the script file that launches the jobs and re-run the job sequence from the middle. I really hope to find a smarter way to do this. As I need to frequently adjust the parameters and the logic of the algorithms, I don't want to repeat those steps again and again.
[Concern 2]. Speed and efficiency. I think a great proportion of the time costs is for parsing the text into numbers and such kind of format-conversion processes which are totally unnecessary. It takes a lot to time to type and test those near-duplicate Python codes during development, too.
[Concern 3]. Easiness to maintain and distribute. As the project is getting big (I am having more than 20 map-reduce jobs), it really needs to modularize the whole project. I want to make it object-oriented, and use IDE to develop and maintain the project, so that it would be more comfortable handling the various internal data structure and format. And also I want to distribute my project as a package, so that other people may benefit from it. I need an easy way to import the whole project in an IDE and make it easy to distribute.

I've done some research on the possibilities of solutions:

Alternative 1. Hadoop custom jars: It seems that the best way to do this is to convert the entire project into java - a hadoop custom jar. This may fix all the problems, including job chaining, efficiency and maintenance issue. But it may take quite much time, and I have found a way to do efficient debugging.
Alternative 2. Pig: I found the answer to this question to be quite helpful in figuring out when to (not) use pig. In the answer, Arun_suresh says if "you have some very specific computation you need to do within your Map/reduce functions … then you should consider deploying your own jars". My job includes shingling, hashing, min-hashing, permutation, etc. Can it be implemented using Pig Latin? Is there a place that I can get an idea about how complex computations can Pig Latin programs have?
Alternative 3. Mahout: I found that the newly released Apache Mahout versions has several functions that overlap with what I am doing, but it cannot replace my work. Shall I base my project on Mahout?

Since I am basically on my own to do this job, and only have about 2 weeks' time budget to do the whole housekeeping work, and about 1 month to improve it, I really need to find an efficient and reliable way to do it. Please help me choose one of the alternatives or tell me if you have a better solution.

score 1 · Accepted Answer · answered Jun 21 '12 at 22:48

Concern 1: Look into using OOZIE, you can define the entire job chain in a 'workflow', with a number of actions (they can be Map reduce steps, decision switches / joins).

Each action can define a prepare stage where you can delete the upcoming output folder.

You are also able to see where in the workflow failure occured, and re-run the workflow from that particular step.

The OOZIE documentation has examples of running streaming jobs, and some cookbook examples too (Case 7)
Concern 2: This is a problem shared by all streaming jobs, if you want the best performance you'll either need to bite the Java bullet, or seek alternative stream input / output formats that can handle binary formats (such as AVRO), and you'll need some custom mapper / reducers stream wrappers that know how to handle these.
Concern 3: With streaming, you can bundle all the python scripts into a tarball, and have hadoop unpack the tarball to every task node. This is a function of the DistributedCache archives feature (in OOZIE this is defined in the archives element of the map-reduce action - see Case 6 in the previous OOZIE cookbook link)

Which one to choose: Hadoop custom jars, Hadoop Streaming, Pig, or Mahout?

1 Answers1