2

I was writing a code example showing a problem with a race condition with numba.jit and parallel=True.

import numpy as np
import pandas as pd
from numba import jit, prange
from collections import Counter
import fractions

n = 10**6
m = 10**6

@jit(nopython=True, parallel=True)
def test():
    lst = [0]
    for i in prange(n):
        lst[0] += 1
    return lst

error = Counter([str(fractions.Fraction(test()[0], n)) for _ in range(m)])

df = pd.DataFrame(error.items())
def func(x,y='1'): return int(x)/int(y)
df[2] = df[0].apply(lambda _str: func(*_str.split('/')))
df = df.sort_values(2)
ax = df.plot.bar(x=0, y=1)
ax.set_xlabel('ratio count/maximal_count')
ax.get_legend().remove()

What surprised me was that the miscounts due to race condition are multiples of n/cpu cores. It is distributed like this. enter image description here

I basically understand what's going on:

lst[0] += 1

is short for

buffer = lst[0]
lst[0] = buffer+1 

And if an other process is doing the same thing they might overwrite in the wrong moment.

I have two questions though:

  • Can somebody confirm that it's ruffly distributed like this?
  • And why is it distributed like this?
Lukas S
  • 3,212
  • 2
  • 13
  • 25
  • so basicaly `sum(1 for _ in range(n)) == n` is your test? – Joran Beasley Aug 09 '21 at 17:25
  • @JoranBeasley I don't quite get your question. I give an example of a race condition. Run it 10**5 times and am surprised that it's going wrong .../8 of the time. – Lukas S Aug 09 '21 at 17:34
  • I get completely different results on my machine (4 runs): https://i.stack.imgur.com/rNjfK.png. This is very dependent of the platform. – Jérôme Richard Aug 09 '21 at 19:50
  • @JérômeRichard What do you mean 4 runs? It does have the very discrete look too though. What exact values are you getting? Is it also a multiple of n/cpu cores? I added the code for my plot. Would you mind doing the exact same thing? – Lukas S Aug 09 '21 at 20:45
  • I ran your code 4 times. For each set of result I stacked new points to a scatter plot. the x axis is the fractional part (converted to a float) and the y is the counter. Note that I get multiple very close fractions on my 6-core machine. [Here](https://i.stack.imgur.com/X4PAW.png) are the new results based on your new code. – Jérôme Richard Aug 09 '21 at 21:18
  • My results on an 8-processor system have different distribution, and it varies from run to run. My peak is at 1/4, 3/8, 1/2. I'm very surprised by this. In regular Python, the list += operator is specially handled to be atomic. – Tim Roberts Aug 09 '21 at 21:20
  • @TimRoberts Can you tell more about that atomic handling? As far as I can tell, it's [quite a few separate steps](https://tio.run/##K6gsycjPM7YoKPr/PzO3IL@oRCEls5gLiPWAWEM9p7gk2iBWQdtWwVBd8/9/AA). – Kelly Bundy Aug 09 '21 at 21:31
  • @KellyBundy @TimRoberts apparently `+=` is not thread safe but `lst.append(...)` is. https://stackoverflow.com/questions/6319207/are-lists-thread-safe – Lukas S Aug 09 '21 at 21:48
  • Hmmm. I have always understood that `lst[0] += 1` is cheaper than `lst[0] = lst[0] + 1` because there was a magic inplace add handler. `dis.dis` shows it's exactly the same number of IL operations. Next you'll tell me there's no Santa. – Tim Roberts Aug 09 '21 at 21:57
  • Looks like every core gets it's own list and than accumulates on it. One of them is returned. But why don't you use a simple scalar (this case is handled by Numba)? BTW: Using global variables is also not a good idea. They are hard-coded in the compiled function and won't be updated without (manual) recompilation. – max9111 Aug 10 '21 at 08:23
  • @max9111 sorry but did you read my post? It was supposed to be an example of a race condition. Interesting note on the global `n`. So you think it's dividing up the task into cpu cores amount of chunks and has a buffer in each as n optimizer? That seems plausible but do you have a a source or experiment that would support that? – Lukas S Aug 10 '21 at 11:08

0 Answers0