9

I have been scouring SO for the best way of applying a function that takes multiple separate Pandas DataFrame columns and outputs multiple new columns in the same said DataFrame. Let's say I have the following:

def apply_func_to_df(df):
    df[['new_A', 'new_B']] = df.apply(lambda x: transform_func(x['A'], x['B'], x['C']), axis=1)

def transform_func(value_A, value_B, value_C):
    # do some processing and transformation and stuff
    return new_value_A, new_value_B

I am trying to apply this function as shown above to the whole DataFrame df in order to output 2 NEW columns. However, this can generalize to a usecase/function that takes in n DataFrame columns and outputs m new columns to the same DataFrame.

The following are things I have been looking at (with varying degrees of success):

  • Create a Pandas Series for the function call, then append to the existing DataFrame,
  • Zip the output columns (but there are some issues that happen in my current implementation)
  • Re-write the basic function transform_func to explicitly expect rows (i.e. fields) A, B, C as follows, then do an apply to the df:

def transform_func_mod(df_row):
    # do something with df_row['A'], df_row['B'], df_row['C]
    return new_value_A, new_value_B

I would like a very general and Pythonic way to accomplish this task, while taking performance into account (both memory- and time-wise). I would appreciate any input on this, as I have been struggling with this due to my unfamiliarity with Pandas.

bharatk
  • 4,202
  • 5
  • 16
  • 30
qxzsilver
  • 522
  • 1
  • 6
  • 21
  • 1
    When you write "function that takes multiple separate Pandas DataFrame columns and outputs multiple new columns in the same said DataFrame", are you saying your function operates on columns and returns new columns (i.e. Series objects)? In other words, is your function already vectorized? – NicholasM Nov 11 '19 at 19:55
  • 1
    Not for this case. I am trying to figure out various tradeoffs (in terms of algorithmic complexity, memory, overhead etc.) of different methods, including apply, using zip, NumPy vs. Pandas vectorization, etc. and would like to choose the best approach for my usecase. – qxzsilver Nov 12 '19 at 02:42

2 Answers2

7

Write your transform_func the following way:

  • it should have one parameter - the current row,
  • this function can read individual columns from the current row and make any use of them,
  • the returned object should be a Series with:
    • values - whatever you want to return,
    • index - target column names.

Example: Assuming that all 3 columns are of string type, concatenate A and B columns, add "some string" to C:

def transform_func(row):
    a = row.A; b = row.B; c = row.C;
    return pd.Series([ a + b, c + '_xx'], index=['new_A', 'new_B'])

To get only the new values, apply this function to each row:

df.apply(transform_func, axis=1)

Note that the resulting DataFrame retains keys of the original rows (we will make use of this feature in a moment).

Or if you want to add these new columns to your DataFrame, join your df with the result of the above application, saving the join result under the original df:

df = df.join(df.apply(transform_func, axis=1))

Edit following the comment as of 03:36:34Z

Using zip is probably the slowest option. Row-based function should be quicker and it is a more intuitive construction. Probably the quickest way is to write 2 vectorized expressions, for each column separately. In this case something like:

df['new_A'] = df.A + df.B
df['new_B'] = df.C + '_xx'

But generally the problem is whether a row-based function can be expressed as vectorized expressions (as I did above). In the "negative" case you can apply a row-based function.

To compare how quick is each solution, use %timeit.

Valdi_Bo
  • 30,023
  • 4
  • 23
  • 41
  • I'd also like to know what the tradeoffs are with the different potential ways to approach the problem. For example, what is the way to do this via zip(*) (if any)? How does this perform against writing a row-based function and using .apply? Are there vectorized methods that can improve algorithmic time complexity as well as memory? I know I can use the method aforementioned by you, but I want to try to choose the best usecase – qxzsilver Nov 12 '19 at 03:36
  • Thanks for the helpful hint. I'll take a look and see how Pandas apply is implemented to try to get an understanding of the algorithmic complexity, as well as how exactly the zip and zip(*) works (I'm still not completely sure exactly how it works. – qxzsilver Nov 13 '19 at 14:43
4

The question seems somewhat related to this question. I referenced the comment made by @spen.smith on this answer in coming up with this.

df = pd.DataFrame([[1,2,3], [2,3,4], [3,5,7]], columns = ['A', 'B', 'C'])
print(df)

   A  B  C
0  1  2  3
1  2  3  4
2  3  5  7

Rather than modifying the return of the function, just create it as usual

def add_subtract(args):
    arg1, arg2 = args

    ret1 = arg1 + arg2
    ret2 = arg1 - arg2

    return ret1, ret2

Examine the output of using apply. The option result_type='expand' returns the result as a dataframe instead of as a series of tuples.

print(df[['B', 'C']].apply(add_subtract, axis=1, result_type='expand'))

    0  1
0   5 -1
1   7 -1
2  12 -2

We can then assign the columns of the apply output to two new series by transposing followed by accessing the values. Transposing is necessary because the default behavior of calling values treats each row as a list, whereas we want each column as a list. So the final expression is:

df['D'], df['E'] = df[['B', 'C']].apply(add_subtract, axis=1, result_type='expand').transpose().values
print(df)

   A  B  C   D  E
0  1  2  3   5 -1
1  2  3  4   7 -1
2  3  5  7  12 -2
Benedictanjw
  • 828
  • 1
  • 8
  • 19