I am trying to understand the map-reduce concept and looking at implementing small programs using mincemeat.py, an open source library for python.
I have obtained the simple word count for a bag of words using the mapper and reducer. However, I would like to implement finding tf-idf scores for all words across documents. To accomplish this, first step I thought is to obtain a dictionary of the type {[word,docID]->count}
. For this I have written the following code
def mapfn(k, v):
for line in v.splitlines():
for word in line.split():
l = [word.lower(), k]
yield l, 1
However, when I run the program, I am getting the following error.
error: uncaptured python exception, closing channel <__main__.Client connected at 0x8a434ac>
(<type 'exceptions.TypeError'>:unhashable type: 'list'
[/usr/lib/python2.7/asyncore.py|read|83]
[/usr/lib/python2.7/asyncore.py|handle_read_event|444]
[/usr/lib/python2.7/asynchat.py|handle_read|140]
[mincemeat.py|found_terminator|96]
[mincemeat.py|process_command|194]
[mincemeat.py|call_mapfn|171])
What I understand is that we cannot yield a list inside map when using mincemeat.py because the error says the list is not expected while reducing. Am I correct? If I am correct, is there any way out to accomplish this? Or, do I need to look at any other libraries other than mincemeat?