6

I am applying a function on the rows of a dataframe in pandas. That function returns four values (meaning, four values per row). In practice, this means that the returned object from the apply function is a Series containing tuples. I want to add these to their own columns. I know that I can convert that output to a DataFrame and then concatenate with the old DataFrame, like so:

import pandas as pd


def some_func(i):
    return i+1, i+2, i+3, i+4

df = pd.DataFrame(range(10), columns=['start'])
res = df.apply(lambda row: some_func(row['start']), axis=1)

# convert to df and add column names
res_df = res.apply(pd.Series)
res_df.columns = ['label_1', 'label_2', 'label_3', 'label_4']

# concatenate with old df
df = pd.concat([df, res_df], axis=1)
print(df)

My question is whether there is a better way to do this? Especially the res.apply(pd.Series) seems redundant, but I don't know a better alternative. Performance is an important factor for me.


As requested, an example input DataFrame could look like this

   start
0      0
1      1
2      2
3      3
4      4
5      5
6      6
7      7
8      8
9      9

And the expected output, with the four added columns:

   start  label_1  label_2  label_3  label_4
0      0        1        2        3        4
1      1        2        3        4        5
2      2        3        4        5        6
3      3        4        5        6        7
4      4        5        6        7        8
5      5        6        7        8        9
6      6        7        8        9       10
7      7        8        9       10       11
8      8        9       10       11       12
9      9       10       11       12       13
Bram Vanroy
  • 27,032
  • 24
  • 137
  • 239

4 Answers4

5

Directly assigning values to the DataFrame would be faster than the concating.

This is one of the way to assign

df = pd.DataFrame(range(10), columns=['start'])

df['label_1'], df['label_2'], df['label_3'], df['label_4'] = zip(*[some_func(x) for x in df['start']])

This is faster than res.apply(pd.Series).

Refer adding multiple columns to pandas simultaneously for more ways to assign multiple columns.

Keval Dave
  • 2,777
  • 1
  • 13
  • 16
3

Here are a couple of things to try, instead of apply method twice, use a generator and the pandas.DataFrame constructor, and the DataFrame.join method:

import pandas as pd

def some_func(i):
    return i+1, i+2, i+3, i+4

df = pd.DataFrame(range(10), columns=['start'])

df = df.join(pd.DataFrame(some_func(x) for x in df['start'])
             .rename(columns=lambda x: f'label_{x+1}'))

[out]

   start  label_1  label_2  label_3  label_4
0      0        1        2        3        4
1      1        2        3        4        5
2      2        3        4        5        6
3      3        4        5        6        7
4      4        5        6        7        8
5      5        6        7        8        9
6      6        7        8        9       10
7      7        8        9       10       11
8      8        9       10       11       12
9      9       10       11       12       13

Timings

This should offer ~144x speedup

Using df = df = pd.DataFrame(range(100000), columns=['start']) for testing.

original method:

22.2 s ± 246 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

this solution:

152 ms ± 3.31 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

@Andrea solution:

1.66 s ± 17.3 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

@Keval Dave solution: <- fastest! ~230 x speedup

95.2 ms ± 857 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
Chris Adams
  • 18,389
  • 4
  • 22
  • 39
  • 2
    Could you test mine as well? I suspect `join` might be faster – Andrea Feb 04 '20 at 11:30
  • 2
    `1.61 s ± 46.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)` - still a significant speedup :) – Chris Adams Feb 04 '20 at 11:33
  • Yep, indeed. I tested it as well. Btw it seems that @Keval Dave method is even faster – Andrea Feb 04 '20 at 11:35
  • 1
    Not sure if it matters for timing, but you didn't test all implementations with the same number of loops. Otherwise, a great answer. Thanks! – Bram Vanroy Feb 04 '20 at 11:41
  • I notice something strange. If I test the timing using `%%time` (just one run) I get that the @Chris A version is consistently faster then @Keval Dave solution. Can you replicate that behavior? – Andrea Feb 04 '20 at 11:46
  • I'm re-running with 10 loops for all solutions, will update timings - still seeing Keval's as the fastest though – Chris Adams Feb 04 '20 at 11:47
  • I'm actually quite surprised that a "non-pandas" approach (by Keval Dave) works fastest. I know it's not fair to call it a non-pandas approach, but I had expected some kind of pandas built-in that did this, but apparently a straightforward Python approach is even better. That's nice. – Bram Vanroy Feb 04 '20 at 13:18
1

If you want a one-liner you can try with:

df[['label_1', 'label_2', 'label_3', 'label_4']] = pd.DataFrame(
    df.apply(lambda row: some_func(row['start']), axis=1).tolist())

This seems to be a very fast method as suggested here

Andrea
  • 2,932
  • 11
  • 23
0

If you return a pd.Series from your function, then Pandas will turn its elements into columns of the resulting DataFrame when calling apply().

If you use an index for the Series, the elements in the index will become the names of the columns of the resulting DataFrame.

In your case:

res_index = pd.Index([
    'label_1',
    'label_2',
    'label_3',
    'label_4',
])
res_df = df.apply(
    lambda row: pd.Series(some_func(row['start']), index=res_index),
    axis=1,
)
df = pd.concat([df, res_df], axis=1)
print(df)

Perhaps slightly cleaner for res_df is to apply it to the series only:

res_df = df['start'].apply(
    lambda i: pd.Series(some_func(i), index=res_index),
)

If you're willing to rewrite your some_func to return a pd.Series directly:

def some_func(i, index=None):
    return pd.Series(
        [i+1, i+2, i+3, i+4],
        index=index,
    )

res_index = pd.Index([
    'label_1',
    'label_2',
    'label_3',
    'label_4',
])
res_df = df['start'].apply(some_func, index=res_index)
df = pd.concat([df, res_df], axis=1)
print(df)

All of the above will return the expected result:

   start  label_1  label_2  label_3  label_4
0      0        1        2        3        4
1      1        2        3        4        5
2      2        3        4        5        6
3      3        4        5        6        7
4      4        5        6        7        8
5      5        6        7        8        9
6      6        7        8        9       10
7      7        8        9       10       11
8      8        9       10       11       12
9      9       10       11       12       13
filbranden
  • 8,522
  • 2
  • 16
  • 32
  • 1
    Looks like this won't be much faster than the original, since creating one Series per row will still be the bottleneck. Quite often `zip()` really takes the crown! – filbranden Feb 04 '20 at 12:40