I am running a series of MapReduce
jobs on EMR. However, the 3rd MapReduce
job needs the data output from the 2nd MapReduce
job, and the output is essentially over a million key-value pairs (both the key and the value are less than 1KB). Is there a good way to store this information in a distributed store on the same machine as the EMR so the subsequent jobs can access the information? I looked at DistributedCache
, but it's more for storing files? I am not sure if Hadoop is optimized for storing a million tiny files..
Or maybe I can somehow use another MapReduce
job to combine all of the key-value pairs into ONE output file, and then put that entire file into DistributedCache
.
Please advise. Thanks!