0

I have a Pandas dataframe with an arbitrary number of columns. I'd like to apply a function to every column. From the discussion on this SO POst, it's better to use np.vectorize compared to a pandas apply function.

However, how would I use np.vectorize to perform operations over every column?

The best idea I can come up with is np.vectorize with a for loop, but that comes out to take 2x the time on my machine on a dummy dataframe. Is apply with raw=True optimal, and then in terms of even faster options, we can only then take advantage of numba.

test_df = pd.DataFrame({'a': np.arange(5), 'b': np.arange(5), 'c': np.arange(5)})

def myfunc(a, b):
    return a+b

start_time = time.time()
test_df.apply(lambda x: x + 3, raw = True)
print("--- %s seconds ---" % (time.time() - start_time))

start_time = time.time()
for i in range(test_df.shape[1]):
    np.vectorize(myfunc)(test_df.iloc[:,i], 3)
print("--- %s seconds ---" % (time.time() - start_time))
matsuo_basho
  • 2,833
  • 8
  • 26
  • 47
  • 1
    What is the typical number of rows in the dataframe? With just 5 rows, performance comparisons may be difficult to interpret. – hilberts_drinking_problem Jun 06 '21 at 12:02
  • Read the answers in your link. `np.vectorize` is not fast. If it beats apply, it's because apply is incredibly slow. – hpaulj Jun 06 '21 at 12:11
  • test `myfunc(test_df.iloc[:,i],3)` – hpaulj Jun 06 '21 at 12:23
  • In the link, np.vectorize is the second fastest option after using `list` with `map`. I'm deploying the intended function on grouped dataframes with different groups hierarchies, so the number of rows in each group will vary between 4 and around 20K – matsuo_basho Jun 06 '21 at 12:43
  • On a 10K dummy dataframe, `myfunc(test_df.iloc[:,i],3)` takes ~1/3rd longer than apply – matsuo_basho Jun 06 '21 at 12:53

1 Answers1

2

The correct way to use np.vectorize is to not use it - unless you are dealing with a function that only accepts scalar values, and you don't care about speed. When ever I've tested it, explicit Python iteration has been faster.

At least that's the case when working with numpy arrays. With DataFrames, things become more complicated, since extracting Series and recreating frames can skew the timings substantially.


But lets look at your example in some detail.

Your sample frame:

In [177]: test_df = pd.DataFrame({'a': np.arange(5), 'b': np.arange(5), 'c': np.arange(5)})
     ...: 
In [178]: test_df
Out[178]: 
   a  b  c
0  0  0  0
1  1  1  1
2  2  2  2
3  3  3  3
4  4  4  4
In [179]: def myfunc(a, b):
     ...:     return a+b
     ...: 

Your apply:

In [180]: test_df.apply(lambda x: x+3, raw=True)
Out[180]: 
   a  b  c
0  3  3  3
1  4  4  4
2  5  5  5
3  6  6  6
4  7  7  7
In [181]: timeit test_df.apply(lambda x: x+3, raw=True)
186 µs ± 524 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
# 1.23 ms ± 13.9 µs per loop without **raw**

I get the same thing by simply using the frame's own addition operator - and it is faster. Ok, for a more general function that won't work. Your use of apply with default axis and raw implies you have a function that only works with one column at a time.

In [182]: test_df+3
Out[182]: 
   a  b  c
0  3  3  3
1  4  4  4
2  5  5  5
3  6  6  6
4  7  7  7
In [183]: timeit test_df+3
114 µs ± 3.02 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

With raw you are passing numpy arrays to the lambda. Array for the whole frame is:

In [184]: test_df.to_numpy()
Out[184]: 
array([[0, 0, 0],
       [1, 1, 1],
       [2, 2, 2],
       [3, 3, 3],
       [4, 4, 4]])
In [185]: test_df.to_numpy()+3
Out[185]: 
array([[3, 3, 3],
       [4, 4, 4],
       [5, 5, 5],
       [6, 6, 6],
       [7, 7, 7]])
In [186]: timeit test_df.to_numpy()+3
13.1 µs ± 119 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

That's much faster. But to return a frame takes time.

In [188]: timeit pd.DataFrame(test_df.to_numpy()+3, columns=test_df.columns)
91.1 µs ± 769 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In [189]: 

Testing vectorize.

In [218]: f=np.vectorize(myfunc)

f applies myfunc to each element of the input array iteratively. It has a clear performance disclaimer.

Even for this small array it is slow compared to direct application of the function to the array:

In [219]: timeit f(test_df.to_numpy(),3)
42.3 µs ± 123 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

Passing the frame itself

In [221]: timeit f(test_df,3)
69.8 µs ± 1.98 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In [223]: timeit pd.DataFrame(f(test_df,3), columns=test_df.columns)
154 µs ± 2.11 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

And iteratively applying to columns - much slower:

In [226]: [f(test_df.iloc[:,i], 3) for i in range(test_df.shape[1])]
Out[226]: [array([3, 4, 5, 6, 7]), array([3, 4, 5, 6, 7]), array([3, 4, 5, 6, 7])]
In [227]: timeit [f(test_df.iloc[:,i], 3) for i in range(test_df.shape[1])]
477 µs ± 17.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

but a lot of that extra time comes from "extracting" columns:

In [228]: timeit [f(test_df.to_numpy()[:,i], 3) for i in range(test_df.shape[1])]
127 µs ± 357 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

Quoting the np.vectorize docs:

Notes
-----
The `vectorize` function is provided primarily for convenience, not for
performance. The implementation is essentially a for loop.
hpaulj
  • 221,503
  • 14
  • 230
  • 353