2

I am planning to use mincemeat.py for my map reduce task on a ~100GB file. After seeing the example code from mincemeat, it seems I need to input an in-memory dictionary as the data source. So, what is the right way to provide my huge file as the data source for mincemeat?

Link to mincemeat: https://github.com/michaelfairley/mincemeatpy

senshin
  • 10,022
  • 7
  • 46
  • 59
Karthikeyan
  • 990
  • 5
  • 12
  • Tried iterator instead? – dmitry Jul 29 '13 at 09:16
  • Seems I have to create a complete dictionary, beforehand. Do you want me to try iterator for file? but, I need to add the file contents to dict. This is where I am a bit confused. – Karthikeyan Jul 29 '13 at 09:18
  • 2
    Citing from github page: datasource: ...You may use a dict, or any other data structure which implements the iterator protocol (__iter__() and next()) for returning all keys... Seems it is the only reasonable way to go with huge files, though I'd like to know exact practical solution as well as you :) – dmitry Jul 29 '13 at 09:23
  • Just don't forget to fix your solution when you find one, friend – dmitry Jul 29 '13 at 10:15
  • @dmitry, Yes, surely I will do that. – Karthikeyan Jul 29 '13 at 10:18

1 Answers1

0

Looking at the example and the concept I would have thought that you would ideally:

  1. Produce an iterator for the data source,
  2. Spilt the file into a number of meerly large files on a number of servers and then
  3. Merge the results.
Steve Barnes
  • 27,618
  • 6
  • 63
  • 73