20

If I have a function

def do_irreversible_thing(a, b):
    print a, b

And a dataframe, say

df = pd.DataFrame([(0, 1), (2, 3), (4, 5)], columns=['a', 'b'])

What's the best way to run the function exactly once for each row in a pandas dataframe. As pointed out in other questions, something like df.apply pandas will call the function twice for the first row. Even using numpy

np.vectorize(do_irreversible_thing)(df.a, df.b)

causes the function to be called twice on the first row, as will df.T.apply() or df.apply(..., axis=1).

Is there a faster or cleaner way to call the function with every row than this explicit loop?

   for idx, a, b in df.itertuples():
       do_irreversible_thing(a, b)
Community
  • 1
  • 1
David Nehme
  • 21,379
  • 8
  • 78
  • 117
  • This sounds like a job for a `for` loop. There generally isn't a good way to vectorize side effects. – user2357112 Apr 13 '16 at 21:04
  • 1
    If the side effects don't depend on the operation for each row then it should be vectorizable – EdChum Apr 13 '16 at 21:06
  • 1
    If you need to run an explicit loop, you may get better performance with `zip(df.a, df.b)` or `df.itertuples()`, as detailed in [this answer](http://stackoverflow.com/a/34311080/3339965). – root Apr 13 '16 at 22:11

2 Answers2

16

The way I do it (because I also don't like the idea of looping with df.itertuples) is:

df.apply(do_irreversible_thing, axis=1)

and then your function should be like:

def do_irreversible_thing(x):
    print x.a, x.b

this way you should be able to run your function over each row.

OR

if you can't modify your function you could apply it like this

df.apply(lambda x: do_irreversible_thing(x[0],x[1]), axis=1)
Rosa Alejandra
  • 732
  • 5
  • 21
6

It's unclear what your function is doing but to apply a function to each row you can do so by passing axis=1 to apply your function row-wise and pass the column elements of interest:

In [155]:
def foo(a,b):
    return a*b
​
df = pd.DataFrame([(0, 1), (2, 3), (4, 5)], columns=['a', 'b'])
df.apply(lambda x: foo(x['a'], x['b']), axis=1)

Out[155]:
0     0
1     6
2    20
dtype: int64

However, so long as your function does not depend on the df mutating on each row, then you can just use a vectorised method to operate on the entire column:

In [156]:
df['a'] * df['b']

Out[156]:
0     0
1     6
2    20
dtype: int64

The reason is that because the functions are vectorised then it will scale better whilst the apply is just syntactic sugar for iterating on your df so it's a for loop essentially

EdChum
  • 376,765
  • 198
  • 813
  • 562