hadoop/emr how to store key-value pairs

Question

I am running a series of MapReduce jobs on EMR. However, the 3rd MapReduce job needs the data output from the 2nd MapReduce job, and the output is essentially over a million key-value pairs (both the key and the value are less than 1KB). Is there a good way to store this information in a distributed store on the same machine as the EMR so the subsequent jobs can access the information? I looked at DistributedCache, but it's more for storing files? I am not sure if Hadoop is optimized for storing a million tiny files..

Or maybe I can somehow use another MapReduce job to combine all of the key-value pairs into ONE output file, and then put that entire file into DistributedCache.

Please advise. Thanks!

score 0 · Accepted Answer · answered May 05 '13 at 19:46

Usually, the output of a map reduce job is stored in HDFS (or S3). The number of reducers of this job determines the number of output files. How come you have a million of tiny files? Do you run a million reducers? I'm not so sure.

So if you define a single reducer for your 2nd job, you'll automatically end up with a single output file, which will be stored in HDFS. Your 3rd job will be able to access and process this file as input. If the 2nd job needs multiple reducers, you'll have multiple output files. 1 million key-value pairs with key and value of 1 KB each give you a < 2 GB file. With a HDFS block size of 64 MB, you'll end up with result files with size N*64 MB, which will allow the 3rd job to process the blocks in parallel (multiple mappers).

You should use DistributedCache only if the whole file needs to be read in every single mapper. However with a size of max. 2 GB it is a rather flawed approach.

Thanks for the answer, we ended up using `DistributedCache`, but constructing the `HashMap` in the `setUp(context)` method of the next `MapReduce` so that each node only has to construct the map once :] — Jin, May 06 '13 at 01:59
@Jin you're welcome. You also might want to try reusing the JVM for multiple mappers, so that you read the hashmap only once for N mappers. See http://stackoverflow.com/questions/4877691/is-it-possible-to-run-several-map-task-in-one-jvm — harpun, May 06 '13 at 17:07

hadoop/emr how to store key-value pairs

1 Answers1