1

In the following code whith goal to do a wordcount, the add_counts function is concurrently called as a thread, is this operation of reading and updating threadsafe, this answer says that the dictionary update may be threadsafe but what about reading and updating like below:

word_counts={}

@concurrent
def add_counts(line):
    for w in line.split():

        word_counts[w] = word_counts.get(w, 0) + 1

for line in somebigfile:
    add_counts(line)
Community
  • 1
  • 1
stackit
  • 3,036
  • 9
  • 34
  • 62

1 Answers1

2

Reading and updating is not thread-safe ­­­– here's an example that you can try to use locally to see the effect in practice:

from threading import Thread


def add_to_counter(ctr):
    for i in range(100000):
        ctr['ctr'] = ctr.get('ctr', 0) + 1


ctr = {}

t1 = Thread(target=add_to_counter, args=(ctr,))
t2 = Thread(target=add_to_counter, args=(ctr,))

t1.start()
t2.start()
t1.join()
t2.join()

print(ctr['ctr'])

The results obviously depend on the scheduling and other system/timing-dependent details, but on my system I consistently get different numbers under 200000.

Solution 1: Locks

You could require the threads to acquire a lock every time before they modify the dictionary. This slows down the program execution somewhat.

Solution 2: Sum the counters at the end

Depending on your exact use case, you might be able to assign a separate counter to each thread, and sum the counts together after the threads have finished counting. The dictionary-like collections.Counter allows you to easily add two counters together (here's the above example modified to use Counters):

from collections import Counter
from threading import Thread


def add_to_counter(counter):
    for i in range(100000):
        counter['ctr'] = counter.get('ctr', 0) + 1


ctr1 = Counter()
ctr2 = Counter()

t1 = Thread(target=add_to_counter, args=(ctr1,))
t2 = Thread(target=add_to_counter, args=(ctr2,))

t1.start()
t2.start()
t1.join()
t2.join()

ctr = ctr1 + ctr2

print(ctr['ctr'])
Community
  • 1
  • 1
vield
  • 196
  • 3
  • In solution 1 how can i make the locks more granular rather then locking entire dict? – stackit Apr 12 '17 at 13:03
  • in second solution I am using python futures so have no control over passing multiple counters as no control over threads – stackit Apr 12 '17 at 13:07
  • @stackit I'm not sure if there is a good way to guard against access to one dictionary key with a lock – if there is, hopefully someone else can point you to it. Possibly you could combine the two suggested solutions by maintaining thread-local `Counter` objects, and at suitable intervals, locking the shared dictionary to add the newest counts to it from the current thread? – vield Apr 12 '17 at 13:13
  • Yes, thats what I was just thinking – stackit Apr 12 '17 at 13:18