Better way to add the result of apply (multiple outputs) to an existing DataFrame with column names

Question

I am applying a function on the rows of a dataframe in pandas. That function returns four values (meaning, four values per row). In practice, this means that the returned object from the apply function is a Series containing tuples. I want to add these to their own columns. I know that I can convert that output to a DataFrame and then concatenate with the old DataFrame, like so:

import pandas as pd


def some_func(i):
    return i+1, i+2, i+3, i+4

df = pd.DataFrame(range(10), columns=['start'])
res = df.apply(lambda row: some_func(row['start']), axis=1)

# convert to df and add column names
res_df = res.apply(pd.Series)
res_df.columns = ['label_1', 'label_2', 'label_3', 'label_4']

# concatenate with old df
df = pd.concat([df, res_df], axis=1)
print(df)

My question is whether there is a better way to do this? Especially the res.apply(pd.Series) seems redundant, but I don't know a better alternative. Performance is an important factor for me.

As requested, an example input DataFrame could look like this

And the expected output, with the four added columns:

   start  label_1  label_2  label_3  label_4
0      0        1        2        3        4
1      1        2        3        4        5
2      2        3        4        5        6
3      3        4        5        6        7
4      4        5        6        7        8
5      5        6        7        8        9
6      6        7        8        9       10
7      7        8        9       10       11
8      8        9       10       11       12
9      9       10       11       12       13

sample initial dataframe and expected output could be helpful — sammywemmy, Feb 04 '20 at 11:21
@sammywemmy Well, the code is a MCVE so you can just run it. But I've added the example data to give you a visual idea. — Bram Vanroy, Feb 04 '20 at 11:39

score 5 · Accepted Answer · answered Feb 04 '20 at 11:18

5

Directly assigning values to the DataFrame would be faster than the concating.

This is one of the way to assign

df = pd.DataFrame(range(10), columns=['start'])

df['label_1'], df['label_2'], df['label_3'], df['label_4'] = zip(*[some_func(x) for x in df['start']])

This is faster than res.apply(pd.Series).

Refer adding multiple columns to pandas simultaneously for more ways to assign multiple columns.

answered Feb 04 '20 at 11:18

Keval Dave

2,777
1
13
16

3

this is the fastest solution :) – Chris Adams Feb 04 '20 at 11:37

Chris Adams · Answer 2 · 2020-02-04T12:35:59.593

3

Here are a couple of things to try, instead of apply method twice, use a generator and the pandas.DataFrame constructor, and the DataFrame.join method:

import pandas as pd

def some_func(i):
    return i+1, i+2, i+3, i+4

df = pd.DataFrame(range(10), columns=['start'])

df = df.join(pd.DataFrame(some_func(x) for x in df['start'])
             .rename(columns=lambda x: f'label_{x+1}'))

[out]

   start  label_1  label_2  label_3  label_4
0      0        1        2        3        4
1      1        2        3        4        5
2      2        3        4        5        6
3      3        4        5        6        7
4      4        5        6        7        8
5      5        6        7        8        9
6      6        7        8        9       10
7      7        8        9       10       11
8      8        9       10       11       12
9      9       10       11       12       13

Timings

This should offer ~144x speedup

Using df = df = pd.DataFrame(range(100000), columns=['start']) for testing.

original method:

22.2 s ± 246 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

this solution:

152 ms ± 3.31 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

@Andrea solution:

1.66 s ± 17.3 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

@Keval Dave solution: <- fastest! ~230 x speedup

95.2 ms ± 857 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

edited Feb 04 '20 at 12:35

answered Feb 04 '20 at 11:15

Chris Adams

18,389
4
22
39

2

Could you test mine as well? I suspect `join` might be faster – Andrea Feb 04 '20 at 11:30
2

`1.61 s ± 46.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)` - still a significant speedup :) – Chris Adams Feb 04 '20 at 11:33
Yep, indeed. I tested it as well. Btw it seems that @Keval Dave method is even faster – Andrea Feb 04 '20 at 11:35
1

Not sure if it matters for timing, but you didn't test all implementations with the same number of loops. Otherwise, a great answer. Thanks! – Bram Vanroy Feb 04 '20 at 11:41
I notice something strange. If I test the timing using `%%time` (just one run) I get that the @Chris A version is consistently faster then @Keval Dave solution. Can you replicate that behavior? – Andrea Feb 04 '20 at 11:46
I'm re-running with 10 loops for all solutions, will update timings - still seeing Keval's as the fastest though – Chris Adams Feb 04 '20 at 11:47
I'm actually quite surprised that a "non-pandas" approach (by Keval Dave) works fastest. I know it's not fair to call it a non-pandas approach, but I had expected some kind of pandas built-in that did this, but apparently a straightforward Python approach is even better. That's nice. – Bram Vanroy Feb 04 '20 at 13:18

score 1 · Answer 3 · answered Feb 04 '20 at 11:23

1

If you want a one-liner you can try with:

df[['label_1', 'label_2', 'label_3', 'label_4']] = pd.DataFrame(
    df.apply(lambda row: some_func(row['start']), axis=1).tolist())

This seems to be a very fast method as suggested here

answered Feb 04 '20 at 11:23

Andrea

2,932
11
23

score 0 · Answer 4 · answered Feb 04 '20 at 12:07

If you return a pd.Series from your function, then Pandas will turn its elements into columns of the resulting DataFrame when calling apply().

If you use an index for the Series, the elements in the index will become the names of the columns of the resulting DataFrame.

In your case:

res_index = pd.Index([
    'label_1',
    'label_2',
    'label_3',
    'label_4',
])
res_df = df.apply(
    lambda row: pd.Series(some_func(row['start']), index=res_index),
    axis=1,
)
df = pd.concat([df, res_df], axis=1)
print(df)

Perhaps slightly cleaner for res_df is to apply it to the series only:

res_df = df['start'].apply(
    lambda i: pd.Series(some_func(i), index=res_index),
)

If you're willing to rewrite your some_func to return a pd.Series directly:

def some_func(i, index=None):
    return pd.Series(
        [i+1, i+2, i+3, i+4],
        index=index,
    )

res_index = pd.Index([
    'label_1',
    'label_2',
    'label_3',
    'label_4',
])
res_df = df['start'].apply(some_func, index=res_index)
df = pd.concat([df, res_df], axis=1)
print(df)

All of the above will return the expected result:

   start  label_1  label_2  label_3  label_4
0      0        1        2        3        4
1      1        2        3        4        5
2      2        3        4        5        6
3      3        4        5        6        7
4      4        5        6        7        8
5      5        6        7        8        9
6      6        7        8        9       10
7      7        8        9       10       11
8      8        9       10       11       12
9      9       10       11       12       13

Looks like this won't be much faster than the original, since creating one Series per row will still be the bottleneck. Quite often `zip()` really takes the crown! — filbranden, Feb 04 '20 at 12:40

Better way to add the result of apply (multiple outputs) to an existing DataFrame with column names

4 Answers4

Timings