I am working on an NLP project that can create entity sets and compute pairwise similarity from large-scale corpora. Currently I am using hadoop streaming and have implemented all the mappers and reducers in Python. Since the algorithm needs several rounds of map-reduce, I use Shell scripts to chain the jobs.
Now here are my concerns and what I want to do next:
[Concern 1]. Job chaining and job controls. Chaining hadoop streaming jobs is problematic. If job2 in the job sequence (job1-job2-job3) failed, I have to manually delete the output folder, adjust the script file that launches the jobs and re-run the job sequence from the middle. I really hope to find a smarter way to do this. As I need to frequently adjust the parameters and the logic of the algorithms, I don't want to repeat those steps again and again.
[Concern 2]. Speed and efficiency. I think a great proportion of the time costs is for parsing the text into numbers and such kind of format-conversion processes which are totally unnecessary. It takes a lot to time to type and test those near-duplicate Python codes during development, too.
[Concern 3]. Easiness to maintain and distribute. As the project is getting big (I am having more than 20 map-reduce jobs), it really needs to modularize the whole project. I want to make it object-oriented, and use IDE to develop and maintain the project, so that it would be more comfortable handling the various internal data structure and format. And also I want to distribute my project as a package, so that other people may benefit from it. I need an easy way to import the whole project in an IDE and make it easy to distribute.
I've done some research on the possibilities of solutions:
Alternative 1. Hadoop custom jars: It seems that the best way to do this is to convert the entire project into java - a hadoop custom jar. This may fix all the problems, including job chaining, efficiency and maintenance issue. But it may take quite much time, and I have found a way to do efficient debugging.
Alternative 2. Pig: I found the answer to this question to be quite helpful in figuring out when to (not) use pig. In the answer, Arun_suresh says if "you have some very specific computation you need to do within your Map/reduce functions … then you should consider deploying your own jars". My job includes shingling, hashing, min-hashing, permutation, etc. Can it be implemented using Pig Latin? Is there a place that I can get an idea about how complex computations can Pig Latin programs have?
Alternative 3. Mahout: I found that the newly released Apache Mahout versions has several functions that overlap with what I am doing, but it cannot replace my work. Shall I base my project on Mahout?
Since I am basically on my own to do this job, and only have about 2 weeks' time budget to do the whole housekeeping work, and about 1 month to improve it, I really need to find an efficient and reliable way to do it. Please help me choose one of the alternatives or tell me if you have a better solution.