Huge file as the data source for mincemeat.py

Question

I am planning to use mincemeat.py for my map reduce task on a ~100GB file. After seeing the example code from mincemeat, it seems I need to input an in-memory dictionary as the data source. So, what is the right way to provide my huge file as the data source for mincemeat?

Link to mincemeat: https://github.com/michaelfairley/mincemeatpy

Seems I have to create a complete dictionary, beforehand. Do you want me to try iterator for file? but, I need to add the file contents to dict. This is where I am a bit confused. — Karthikeyan, Jul 29 '13 at 09:18
Citing from github page: datasource: ...You may use a dict, or any other data structure which implements the iterator protocol (__iter__() and next()) for returning all keys... Seems it is the only reasonable way to go with huge files, though I'd like to know exact practical solution as well as you :) — dmitry, Jul 29 '13 at 09:23
Just don't forget to fix your solution when you find one, friend — dmitry, Jul 29 '13 at 10:15

score 0 · Answer 1 · answered Jul 29 '13 at 09:37

0

Looking at the example and the concept I would have thought that you would ideally:

Produce an iterator for the data source,
Spilt the file into a number of meerly large files on a number of servers and then
Merge the results.

answered Jul 29 '13 at 09:37

Steve Barnes

27,618
6
63
73

Thanks for the suggestion, let me check in this direction. – Karthikeyan Jul 29 '13 at 10:13

Huge file as the data source for mincemeat.py

1 Answers1