Parallel processing of small functions in the cloud

Question

I'm having a few million/billion (10^9) data-input-sets, that need to be processed. They are quiet small < 1kB. And they need about 1 second to be processed.

I have read a lot about Apache Hadoop, Map Reduce and StarCluster. But I am not sure what the most efficient and fastest way is, to process it?

I am thinking of using Amazon EC2 or a similar cloud service.

Hadoop and MapReduce are pretty adaptable but they are definitely better at some things. Are you willing/able to code? What languages do you know? What kind of processing do you need to do on the data? — Paul M, Jul 24 '12 at 19:47
@PaulM The language does not matter, I know Python, Java, Ruby, C, C++ so I will (hopefully) be able to learn it :) The input is a small String and it will be processed like a sha512 hash - at least it is some hash-like function - but other details I am not allowed to provide. — Mark, Jul 24 '12 at 19:52
Sounds like you're working on a rainbow table / password cracker? — BonanzaDriver, Jul 25 '12 at 14:40

score 3 · Accepted Answer · answered Jul 24 '12 at 19:52

3

You might consider something like Amazon EMR which takes care of a lot of the plumbing with Hadoop. If your just looking to code something quickly, hadoop streaming, hive and PIG are all good tools for getting started with hadoop w/out requring you to know all of the ins and outs of MapReduce.

answered Jul 24 '12 at 19:52

Paul M

2,006
17
10

Thanks for your reply. I have added some details in the question's comment. Can you recommend some special method (streaming/hive/pig)? Sorry, that I can not provide some more details. – Mark Jul 24 '12 at 19:55
In that case, I would try using hadoop streaming on Amazon EMR. Hadoop streaming let's you write MapReduce programs like unix pipelines using your language of choice. The tradeoff is a performance penalty that may or may not be meaningful to you. Amazon EMR saves you the trouble of spinning up a cluster. You do have to pay for Amazon EMR. – Paul M Jul 24 '12 at 20:16
Thanks, I will have a deeper look at it. – Mark Jul 24 '12 at 20:59

Parallel processing of small functions in the cloud

1 Answers1