0

Using pandas and numpy, what is the most efficient way to do what the f1 function does?

import numpy as np
import pandas as pd
from time import time

n = 10000
df = pd.DataFrame()
df["a"] = np.random.randn(n)
df["b"] = np.random.uniform(n)


def f1(df):
    df.loc[0, "c"] = 100
    for i in range(1, len(df)):
        df.loc[i, "c"] = df.loc[i, "a"] * df.loc[i, "b"] +\
            (1 - df.loc[i, "a"]) * df.loc[i - 1, "c"]

start_time = time()
f1(df)
ellapsed_time = time() - start_time
print(ellapsed_time)
vwrobel
  • 1,706
  • 15
  • 25
  • What do you want `f1` does?Do you really want `i` to be the index of `df`? – Shihe Zhang Nov 09 '17 at 07:34
  • Instead of using `for` with `range`. `iteriterms()` is a good option. – Shihe Zhang Nov 09 '17 at 07:36
  • Hello Shihe, no I do not really need `i` to be the index but how would you write f1 with `iteritems`? I had tried with `iterrows` but it had not been an improvement. – vwrobel Nov 09 '17 at 07:47
  • Maybe try cython? – ags29 Nov 09 '17 at 08:01
  • Yes ags29, I will turn to Cython if there is no efficient solution with numpy/pandas :) – vwrobel Nov 09 '17 at 08:07
  • iterrows converts rows to series and so should be slow. have you tried itertuples instead? Look at piRSquared's answer here: [Iterate Rows](https://stackoverflow.com/questions/16476924/how-to-iterate-over-rows-in-a-dataframe-in-pandas). Also, you may want to try numba first before Cython to bring in the power of JIT. – skrubber Nov 09 '17 at 08:37

1 Answers1

1

Sometimes scipy.signal can solve such recurence, but I do not find a good solution here. The Numba workaround :

import numba
@numba.njit
def f1n(a,b):
    c=np.empty_like(a)
    c[0]=100
    for i in range(1,len(a)):
        c[i]=a[i]*b[i]+(1-a[i])*c[i-1]
    return c

Tests:

In [559]: %timeit f1n(df.a.values,df.b.values)
52.9 µs ± 1.24 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

In [560]: %timeit f1(df)
4.62 s ± 13 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [563]: np.allclose(df.c,f1n(df.a.values,df.b.values))
Out[563]: True

90,000 x faster, and equally readable.

B. M.
  • 18,243
  • 2
  • 35
  • 54