271

I want to create a new column in a pandas data frame by applying a function to two existing columns. Following this answer I've been able to create a new column when I only need one column as an argument:

import pandas as pd
df = pd.DataFrame({"A": [10,20,30], "B": [20, 30, 10]})

def fx(x):
    return x * x

print(df)
df['newcolumn'] = df.A.apply(fx)
print(df)

However, I cannot figure out how to do the same thing when the function requires multiple arguments. For example, how do I create a new column by passing column A and column B to the function below?

def fxy(x, y):
    return x * y
Community
  • 1
  • 1
Michael
  • 13,244
  • 23
  • 67
  • 115

7 Answers7

395

You can go with @greenAfrican example, if it's possible for you to rewrite your function. But if you don't want to rewrite your function, you can wrap it into anonymous function inside apply, like this:

>>> def fxy(x, y):
...     return x * y

>>> df['newcolumn'] = df.apply(lambda x: fxy(x['A'], x['B']), axis=1)
>>> df
    A   B  newcolumn
0  10  20        200
1  20  30        600
2  30  10        300
Roman Pekar
  • 107,110
  • 28
  • 195
  • 197
  • 6
    This is a great tip, and it leaves the column references near the apply call (in it actually). I used this tip and the multi-column output tip @toto_tico supplied to generate a 3 column in, 4 column out function! Works great! – RufusVS Sep 11 '18 at 19:00
  • 21
    Wow, it seem that you're the only one not focussing on OP's bare minimal example but addresses the whole problem, thanks, exactly what I needed! :) – Matt Oct 19 '18 at 14:06
  • 5
    Indeed this should be the 'official' answer. – Fed Dec 14 '20 at 03:29
193

Alternatively, you can use numpy underlying function:

>>> import numpy as np
>>> df = pd.DataFrame({"A": [10,20,30], "B": [20, 30, 10]})
>>> df['new_column'] = np.multiply(df['A'], df['B'])
>>> df
    A   B  new_column
0  10  20         200
1  20  30         600
2  30  10         300

or vectorize arbitrary function in general case:

>>> def fx(x, y):
...     return x*y
...
>>> df['new_column'] = np.vectorize(fx)(df['A'], df['B'])
>>> df
    A   B  new_column
0  10  20         200
1  20  30         600
2  30  10         300
alko
  • 46,136
  • 12
  • 94
  • 102
  • 3
    Thanks for the answer! I am curious, is this the fastest solution? – MV23 Jun 25 '16 at 17:32
  • 15
    The vectorized version using `np.vectorize()` is amazingly fast. Thank you. – stackoverflowuser2010 Dec 18 '17 at 22:03
  • This is a useful solution. If the size of input arguments to the function x and y are not equal, you get an error. In that case, the @RomanPekar solution works without any problem. I didn't compare the performance. – eSadr Feb 15 '19 at 01:55
  • I know this is an old answer, but: I have an edge case, in which `np.vectorize` does not work. The reason is, that one of the columns is of the type `pandas._libs.tslibs.timestamps.Timestamp`, which gets turned into the type `numpy.datetime64` by the vectorization. The two types are not interchangeable, causing the function to behave badly. Any suggestions on this? (Other than `.apply` as this is apparently to be avoided) – ElRudi Jan 03 '20 at 09:52
  • 2
    Great solution! in case anyone is wondering vectorize works well and super fast for string comparison functions as well. – infiniteloop Apr 28 '20 at 11:24
59

This solves the problem:

df['newcolumn'] = df.A * df.B

You could also do:

def fab(row):
  return row['A'] * row['B']

df['newcolumn'] = df.apply(fab, axis=1)
greenafrican
  • 2,516
  • 5
  • 27
  • 38
  • 17
    This answer solves this toy example and will be enough for me to rewrite my actual function, but it does not address how to apply a previously defined function without rewriting it to reference columns. – Michael Nov 11 '13 at 21:03
  • 2
    Be aware that the vectorized operation (the first code sample) has a lot better performance than the code sample with `apply`. – Niels Bom Jun 03 '21 at 13:57
49

If you need to create multiple columns at once:

  1. Create the dataframe:

    import pandas as pd
    df = pd.DataFrame({"A": [10,20,30], "B": [20, 30, 10]})
    
  2. Create the function:

    def fab(row):                                                  
        return row['A'] * row['B'], row['A'] + row['B']
    
  3. Assign the new columns:

    df['newcolumn'], df['newcolumn2'] = zip(*df.apply(fab, axis=1))
    
toto_tico
  • 17,977
  • 9
  • 97
  • 116
  • 1
    I was wondering how I could generate multiple columns with one apply! I used this with @Roman Pekar's answer to generate a 3 column in, 4 column out function! Works great! – RufusVS Sep 11 '18 at 19:01
  • Would you please explain what does `zip` do here? Thanks! – Mostafa Ghadimi Sep 24 '20 at 00:10
  • `zip` iterates *simultaneously* several iterables (e.g. lists, iterators). `*df.apply` will yield N (N=`len(df)`) iterables, each iterable with 2 elements; `zip` will iterate over the N rows simultaneously, so that it instead yields 2 iterables of N elements. You can test this, e.g. `zip(['a','b'],['c','d'],['e','f'])` will yield `[('a', 'c', 'e'), ('b', 'd', 'f')]` (basically, the transpose). Note that I am intentionally using the word `yield`, as opposed to `return`, because we are talking about iterators (so, transform the zip result into a list: `list(zip(['a','b'],['c','d'],['e','f']))`) – toto_tico Sep 25 '20 at 07:36
  • Alternatively use `result_type='expand'`: `df[['col1', 'col2']] = df.apply(fab, axis=1, result_type='expand')` – fantabolous May 19 '23 at 08:26
18

One more dict style clean syntax:

df["new_column"] = df.apply(lambda x: x["A"] * x["B"], axis = 1)

or,

df["new_column"] = df["A"] * df["B"]
Surya
  • 11,002
  • 4
  • 57
  • 39
4

This will dynamically give you desired result. It works even if you have more than two arguments

df['anothercolumn'] = df[['A', 'B']].apply(lambda x: fxy(*x), axis=1)
print(df)


    A   B  newcolumn  anothercolumn
0  10  20        100            200
1  20  30        400            600
2  30  10        900            300
Babatunde Mustapha
  • 2,131
  • 20
  • 21
1

The answers focus on functions that takes the dataframe's columns as inputs. More in general, if you want to use pandas .apply on a function with multiple arguments, some of which may not be columns, then you can specify them as keyword arguments inside .apply() call:

def fxy(x: , y):
    return x * y

df['newcolumn'] = df.A.apply(fxy, y=df.B)
df['newcolumn1'] = df.A.apply(fxy, y=4)
Luca Clissa
  • 810
  • 2
  • 7
  • 27