0

I need to read a large file and update an imported dictionary accordingly, using multiprocessing Pool and Manager. Here is my code:

from multiprocessing import Pool, Manager

manager = Manager()
d = manager.dict()
imported_dic = json.load(~/file.json) #loading a file containing a large dictionary
d.update(imported_dic)

def f(line):
    data = line.split('\t')
    uid = data[0]
    tweet = data[2].decode('utf-8')

    if #sth in tweet:
        d[uid] += 1

p = Pool(4)
with open('~/test_1k.txt') as source_file:
    p.map(f, source_file)

But it does not work properly. Any idea what am I doing wrong here?

msmazh
  • 785
  • 1
  • 9
  • 19

1 Answers1

0

Try this code:

d = init_dictionary( ) # some your magic here

def f(line):
    data = line.split('\t')
    uid = data[0]
    tweet = data[2].decode('utf-8')

    if uid in d: 
        for n in d[uid].keys(): 
            if n in tweet: 
                 yield uid, n, 1 
            else: 
                 yield uid, n, 0 


p = Pool(4)

with open('~/test_1k.txt') as source_file:
    for stat in p.map(f, source_file):
          uid, n, r = stat
          d[uid][n] += r

It's same solution, but without shared dictionary.

Jimilian
  • 3,859
  • 30
  • 33
  • Thanks. It works, but not perfectly. I have several keys for each uid in my imported dictionary. For some reason, your code only returns the last key for each uid. For example, if d = {'1':{'a':0, 'b':0}}, it only updates d['1']['b'] and loses d['1']['a']. Any idea why? – msmazh Oct 07 '15 at 04:46
  • Assume that i have a file containing 2 rows: 1)100022441 @DavidBartonWB Guarding Constitution 2)100022441 RT @frankgaffney 2nd Amendment Guy. My dict to update is: d={'100022441':{'@frankgaffney':0, '@DavidBartonWB':0}}. And my code is: `def g(line): data = line.split('\t') uid = data[0] tweet = data[2] if uid in d.keys(): for n in d[uid].keys(): if n in tweet: return uid, n, 1 else: return uid, n, 0 p = Pool(4) with open('~/f.txt') as f: for uid, n, r in p.map(g, f): d[uid][n] += r` – msmazh Oct 07 '15 at 22:15
  • the updated d should be: d = {{'100022441':{'@frankgaffney': 1, '@DavidBartonWB': 1 }}. But the code gives: {'100022441':{'@frankgaffney': 1, '@DavidBartonWB': 0 }}. It fails to update the count for '@DavidBartonWB'. – msmazh Oct 07 '15 at 22:18
  • @msmazh, `return` stops function execution immediately, so, you don't iterate second time. If you want to return value and back to you cycle, you should use 'yield'. I updated my solution according your code. – Jimilian Oct 08 '15 at 07:38
  • Many thanks. But it gives the following error: "SyntaxError: 'return' with argument inside generator" – msmazh Oct 08 '15 at 08:32
  • @msmazh, year, `return` was superfluous. You can remove it :) – Jimilian Oct 08 '15 at 09:19