Optimize code to increment slice of a numpy array once per each row of a pandas dataframe

Question

I have a Numpy array of length 500 and a Pandas dataframe containing two columns of integers A and B. For every row of the dataframe, I want to increment the slice of the numpy array by a fraction inversely proportional to the size of the slice.

The best I could come up with is:

import pandas as pd
import numpy as np

def do(n_rows):
    df = pd.DataFrame()
    df['A'] = np.random.choice(range(0, 400), n_rows)
    df['B'] = df['A'] + np.random.choice(range(1, 50), n_rows)
    y = np.zeros(500)
    for row in df[['A', 'B']].itertuples():
        _, l, r = row
        y[l:r] += 1 / (r - l)

if __name__=='__main__':
    from timeit import Timer
    for n_rows in [1, 10, 100, 1000, 10000, 100000, 1000000, 10000000, 100000000]:
        t = Timer("do(%d)" % n_rows, "from __main__ import do")
        print("Time with %d rows:" % n_rows, t.timeit(number=1))

which, on my machine, outputs:

Time with 1 rows: 0.0033148399088531733
Time with 10 rows: 0.0027232079301029444
Time with 100 rows: 0.0028421489987522364
Time with 1000 rows: 0.0047673441004008055
Time with 10000 rows: 0.023656874895095825
Time with 100000 rows: 0.21090111997909844
Time with 1000000 rows: 1.9661296689882874
Time with 10000000 rows: 19.77073140698485
Time with 100000000 rows: 198.70571927190758

I need to run this code on ~70 million lines, about 150 times, every time I want to re-analyse my data. Can anyone think of some way to significantly improve performance here?

Timings + final code (left pandas out of it and cleaned a bit)

import numpy as np

def do(n_rows, solver):
    s0 = np.random.choice(range(0, 400), n_rows)
    s1 = s0 + np.random.choice(range(1, 50), n_rows)
    return solver(s0, s1, 500)

def sol1(s0, s1, points):
    y = np.zeros(points)
    for l, r in zip(s0, s1):
        y[l:r] += 1 / (r - l)
    return y

def sol2(s0, s1, points):
    out = np.zeros(points)
    v = 1.0 / (s1 - s0)
    np.add.at(out,s0,+v)
    np.add.at(out,s1,-v)
    np.cumsum(out,out=out)
    out[np.isclose(out, 0)] = 0
    return out

def sol3(s0, s1, points):
    v = 1.0 / (s1 - s0)
    out = np.bincount(s0, v, minlength=points) - np.bincount(s1, v, minlength=points)
    np.cumsum(out, out=out)
    out[np.isclose(out, 0)] = 0
    return out

if __name__=='__main__':

    # Test for correctness
    s0 = np.array([2,2,4])
    s1 = np.array([4,5,6])
    r1 = sol1(s0, s1, 10)
    r2 = sol2(s0, s1, 10)
    r3 = sol3(s0, s1, 10)
    print(r1.tolist())
    print(r2.tolist())
    print(r3.tolist())
    print(np.allclose(r1, r2))
    print(np.allclose(r1, r3))

    # Test for speed
    from timeit import Timer
    for n_rows in [1, 10, 100, 1000, 10000, 100000, 1000000, 10000000, 100000000]:
        print("-------------------------------------------------------------")
        t = Timer("do(%d, sol1)" % n_rows, "from __main__ import do, sol1")
        print("Old App : Time with %d rows:" % n_rows, t.timeit(number=1))
        t = Timer("do(%d, sol2)" % n_rows, "from __main__ import do, sol2")
        print("New App : Time with %d rows:" % n_rows, t.timeit(number=1))
        t = Timer("do(%d, sol3)" % n_rows, "from __main__ import do, sol3")
        print("Newer App : Time with %d rows:" % n_rows, t.timeit(number=1))

Results

[0.0, 0.0, 0.8333333333333333, 0.8333333333333333, 0.8333333333333333, 0.5, 0.0, 0.0, 0.0, 0.0]
[0.0, 0.0, 0.8333333333333333, 0.8333333333333333, 0.8333333333333333, 0.49999999999999994, 0.0, 0.0, 0.0, 0.0]
[0.0, 0.0, 0.8333333333333333, 0.8333333333333333, 0.8333333333333333, 0.49999999999999994, 0.0, 0.0, 0.0, 0.0]
True
True
-------------------------------------------------------------
Old App : Time with 1 rows: 0.00015379209071397781
New App : Time with 1 rows: 0.00018249684944748878
Newer App : Time with 1 rows: 0.00016245199367403984
-------------------------------------------------------------
Old App : Time with 10 rows: 0.0001194290816783905
New App : Time with 10 rows: 0.00017586094327270985
Newer App : Time with 10 rows: 0.0001628710888326168
-------------------------------------------------------------
Old App : Time with 100 rows: 0.0002433289773762226
New App : Time with 100 rows: 0.00019087712280452251
Newer App : Time with 100 rows: 0.00016685202717781067
-------------------------------------------------------------
Old App : Time with 1000 rows: 0.001470518996939063
New App : Time with 1000 rows: 0.00034397002309560776
Newer App : Time with 1000 rows: 0.0001920647919178009
-------------------------------------------------------------
Old App : Time with 10000 rows: 0.016746808076277375
New App : Time with 10000 rows: 0.0035001228097826242
Newer App : Time with 10000 rows: 0.0006547670345753431
-------------------------------------------------------------
Old App : Time with 100000 rows: 0.15572935109958053
New App : Time with 100000 rows: 0.017682339064776897
Newer App : Time with 100000 rows: 0.0035442619118839502
-------------------------------------------------------------
Old App : Time with 1000000 rows: 1.3663988099433482
New App : Time with 1000000 rows: 0.1856136831920594
Newer App : Time with 1000000 rows: 0.044601815985515714
-------------------------------------------------------------
Old App : Time with 10000000 rows: 14.203968763817102
New App : Time with 10000000 rows: 2.116381539963186
Newer App : Time with 10000000 rows: 0.6719322800636292
-------------------------------------------------------------
Old App : Time with 100000000 rows: 142.28248666599393
New App : Time with 100000000 rows: 21.45379024511203
Newer App : Time with 100000000 rows: 5.622305189957842

25x speedup, not bad ;)

Divakar · Accepted Answer · 2017-07-18T09:37:43.303

We could vectorize that loop with np.add.at/np.bincount with a similar trick as mentioned in this post, like so -

a = df.values    
L = 500
s0,s1 = a[:,0], a[:,1]
v = 1.0/(s1-s0)
# @Daniel F's suggestion :
out = np.bincount(s0,v,minlength=L) - np.bincount(s1,v,minlength=L)
np.cumsum(out,out=out)
out[np.isclose(out,0)] = 0

Benchmarking

Using the same setup as used in the question :

def do(n_rows):
    df = pd.DataFrame()
    df['A'] = np.random.choice(range(0, 400), n_rows)
    df['B'] = df['A'] + np.random.choice(range(1, 50), n_rows)
    y = np.zeros(500)
    for row in df[['A', 'B']].itertuples():
        _, l, r = row
        y[l:r] += 1.0 / (r - l)
    return y

def doNumPy(n_rows):
    df = pd.DataFrame()
    df['A'] = np.random.choice(range(0, 400), n_rows)
    df['B'] = df['A'] + np.random.choice(range(1, 50), n_rows)
    a = df.values

    L = 500
    s0,s1 = a[:,0], a[:,1]
    v = 1.0/(s1-s0)
    out = np.bincount(s0,v,minlength=L) - np.bincount(s1,v,minlength=L)
    np.cumsum(out,out=out)
    out[np.isclose(out,0)] = 0
    return out

if __name__=='__main__':
    from timeit import Timer
    for n_rows in [1, 10, 100, 1000, 10000, 100000, 1000000, 10000000]:
        print "-------------------------------------------------------------"
        t = Timer("do(%d)" % n_rows, "from __main__ import do")
        print("Old App : Time with %d rows:" % n_rows, t.timeit(number=1))
        t = Timer("doNumPy(%d)" % n_rows, "from __main__ import doNumPy")
        print("New App : Time with %d rows:" % n_rows, t.timeit(number=1))

Timings -

-------------------------------------------------------------
('Old App : Time with 1 rows:', 0.0034999847412109375)
('New App : Time with 1 rows:', 0.0022029876708984375)
-------------------------------------------------------------
('Old App : Time with 10 rows:', 0.003280162811279297)
('New App : Time with 10 rows:', 0.00197601318359375)
-------------------------------------------------------------
('Old App : Time with 100 rows:', 0.0030059814453125)
('New App : Time with 100 rows:', 0.0017080307006835938)
-------------------------------------------------------------
('Old App : Time with 1000 rows:', 0.005307912826538086)
('New App : Time with 1000 rows:', 0.0018489360809326172)
-------------------------------------------------------------
('Old App : Time with 10000 rows:', 0.027753829956054688)
('New App : Time with 10000 rows:', 0.0022859573364257812)
-------------------------------------------------------------
('Old App : Time with 100000 rows:', 0.26231813430786133)
('New App : Time with 100000 rows:', 0.008862018585205078)
-------------------------------------------------------------
('Old App : Time with 1000000 rows:', 2.270418882369995)
('New App : Time with 1000000 rows:', 0.06492495536804199)
-------------------------------------------------------------
('Old App : Time with 10000000 rows:', 23.051368951797485)
('New App : Time with 10000000 rows:', 0.6994130611419678)

Same as my previous comment, you could do `out = np.bincount(a[:, 0], wieghts = v, minlength = 500) - np.bincount(a[:, 1], wieghts = v)` instead of the `add.at` and `cumsum` steps. Not sure if that's faster. — Daniel F, Jul 18 '17 at 07:27
@DanielF For some reason I keep missing out on it. Added. Thanks again! — Divakar, Jul 18 '17 at 07:52
Sorry guys, it seems both of your methods return different values than mine. (yes, I factored out the DataFrame initialization!) — marcotama, Jul 18 '17 at 08:57
@tamzord Can you post code to reproduce the difference in values? How did you test out for differences? How different are they? — Divakar, Jul 18 '17 at 08:59
Sorry, my bad. The results are very very close but not equal (0.49999999999999994 vs 0.5). A closer look showed it. I don't care about differences of that order of magnitude. Thanks! — marcotama, Jul 18 '17 at 09:13
By the way, the `bincount` method is more than twice as fast compared to the `add` method :) overall 20x speedup is not bad ;) — marcotama, Jul 18 '17 at 09:14
Added timings. Should I remove the initial code and leave the cleaned one only? — marcotama, Jul 18 '17 at 09:31
@tamzord I would say keep the old one. But, just a little gripe - When dealing with floating pt numbers, I don't think its fair to use `==` to compare results, because by their nature, it's hard to have them equal to all digits :) — Divakar, Jul 18 '17 at 09:33
Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/149460/discussion-between-tamzord-and-divakar). — marcotama, Jul 18 '17 at 09:55

Optimize code to increment slice of a numpy array once per each row of a pandas dataframe

1 Answers1

Benchmarking