11

I'm running a streaming job in Hadoop (on Amazon's EMR) with the mapper and reducer written in Python. I want to know about the speed gains I would experience if I implement the same mapper and reducer in Java (or use Pig).

In particular, I'm looking for people's experiences on migrating from streaming to custom jar deployments and/or Pig and also documents containing benchmark comparisons of these options. I found this question, but the answers are not specific enough for me. I'm not looking for comparisons between Java and Python, but comparisons between custom jar deployment in Hadoop and Python-based streaming.

My job is reading NGram counts from the Google Books NGgram dataset and computing aggregate measures. It seems like CPU utilization on the compute nodes are close to 100%. (I would like to hear your opinions about the differences of having CPU-bound or an IO-bound job, as well).

Thanks!

Amaç

Community
  • 1
  • 1
Ruggiero Spearman
  • 6,735
  • 5
  • 26
  • 37

1 Answers1

4

Why consider deploying custom jars ?

  • Ability to use more powerful custom Input formats. For streaming jobs, even if you use pluggable input/output like it's mentioned here, you are limited to the key and value(s) to your mapper/reducer being a text/string. You would need to expend some amount of CPU cycles to convert to your required type.
  • Ive also heard that Hadoop can be smart about reusing JVMs across multiple Jobs which wont be possible when streaming (can't confirm this)

When to use pig ?

  • Pig Latin is pretty cool and is a much higher level data flow language than java/python or perl. Your Pig scripts WILL tend to be much smaller than an equivalent task written any of the other languages

When to NOT use pig ?

  • Even though pig is pretty good at figuring out by itself how many maps/reduce and when to spawn a map or reduce and a myriad of such things, if you are dead sure how many maps/reduce you need and you have some very specific computation you need to do within your Map/reduce functions and you are very specific about performance, then you should consider deploying your own jars. This link shows that pig can lag native hadoop M/R in performance. You could also take a look at writing your own Pig UDFs which isolate some compute intensive function (and possibly even use JNI to call some native C/C++ code within the UDF)

A Note on IO and CPU bound jobs :

  • Technically speaking, the whole point of hadoop and map reduce is to parallelize compute intensive functions, so i'd presume your map and reduce jobs are compute intensive. The only time the Hadoop subsystem is busy doing IO is in between the map and reduce phase when data is sent across the network. Also if you have large amount of data and you have manually configured too few maps and reduces resulting in spills to disk (although too many tasks will results in too much time spent starting / stopping JVMs and too many small files). A streaming Job would also have the additional overhead of starting a Python/Perl VM and have data being copied to and fro between the JVM and the scripting VM.
arun_suresh
  • 2,875
  • 20
  • 20
  • Thanks! As I already have plain text input/output requirements, custom input formats are irrelevant to my case. The evaluation of Pig tells me that I might rather stay away from it. I already have Python implementations. My scripts are CPU-intensive. They just read from standard input, do some number crunching, and output the result. But I'm not sure if that means my Hadoop job as a whole can be regarded as CPU-bound. In any case, what I really wanted to ask was the interaction between whether a job is CPU-bound or IO-bound and whether it is implemented as a custom jar or streaming job. – Ruggiero Spearman Jul 31 '11 at 14:28
  • Considering the fact that your Maps and Reduce task are going to be running on its own JVM, and the fact that generally the map and reduce functions are CPU bound, these individual Hadoop task will be CPU bound. The coordinating JVM for the Hadoop job would most probably be IO intensive since its busy waiting for response from individual tasks sending data to the map and reduce layer. – arun_suresh Jul 31 '11 at 15:05
  • Actually, I just realized, the JVMs on which the map and reduce task are running also handles some IO (streaming in the input from HDFS and writing the output to HDFS). Since hadoop ensures that the map function is done close to where the data is, that is generally pretty fast (this is not true for the reduce function). – arun_suresh Jul 31 '11 at 15:09