96

I have a dataframe with a timeindex and 3 columns containing the coordinates of a 3D vector:

                         x             y             z
ts
2014-05-15 10:38         0.120117      0.987305      0.116211
2014-05-15 10:39         0.117188      0.984375      0.122070
2014-05-15 10:40         0.119141      0.987305      0.119141
2014-05-15 10:41         0.116211      0.984375      0.120117
2014-05-15 10:42         0.119141      0.983398      0.118164

I would like to apply a transformation to each row that also returns a vector

def myfunc(a, b, c):
    do something
    return e, f, g

but if I do:

df.apply(myfunc, axis=1)

I end up with a Pandas series whose elements are tuples. This is beacause apply will take the result of myfunc without unpacking it. How can I change myfunc so that I obtain a new df with 3 columns?

Edit:

All solutions below work. The Series solution does allow for column names, the List solution seem to execute faster.

def myfunc1(args):
    e=args[0] + 2*args[1]
    f=args[1]*args[2] +1
    g=args[2] + args[0] * args[1]
    return pd.Series([e,f,g], index=['a', 'b', 'c'])

def myfunc2(args):
    e=args[0] + 2*args[1]
    f=args[1]*args[2] +1
    g=args[2] + args[0] * args[1]
    return [e,f,g]

%timeit df.apply(myfunc1 ,axis=1)

100 loops, best of 3: 4.51 ms per loop

%timeit df.apply(myfunc2 ,axis=1)

100 loops, best of 3: 2.75 ms per loop
smci
  • 32,567
  • 20
  • 113
  • 146
Fra
  • 4,918
  • 7
  • 33
  • 50
  • 1
    It's useful to say *"unpack the tuple(/list) returned by the function into multiple columns"*. Instead of *"This is beacause apply will take the result of myfunc without unpacking it. How can I change myfunc so that I obtain a new df with 3 columns?"** Tagged [tag:tuple-unpacking]/iterable-unpacking – smci May 24 '20 at 20:00

7 Answers7

92

Return Series and it will put them in a DataFrame.

def myfunc(a, b, c):
    do something
    return pd.Series([e, f, g])

This has the bonus that you can give labels to each of the resulting columns. If you return a DataFrame it just inserts multiple rows for the group.

U2EF1
  • 12,907
  • 3
  • 35
  • 37
  • see more examples at [flexible apply](http://pandas-docs.github.io/pandas-docs-travis/groupby.html#flexible-apply) – smile-on Feb 01 '16 at 17:20
  • 8
    The series answer seems to be the canonical one. However, on version 0.18.1 the series solution takes about 4x longer than running apply multiple times. – Kaushik Ghose Jun 16 '16 at 12:33
  • 4
    Wouldn't it be super inefficient to create an entire `pd.Series` on each iteration? – Marses Dec 17 '19 at 21:04
  • I get a "AttributeError: 'float' object has no attribute 'index'" when trying this approach, but not sure why its trying to get the index from one of the vales (float)? (Edit) Issue was I had two return statements, one just had Nan, that needed to be wrapped in a pd.Series() as well. – nicway Apr 09 '20 at 13:35
  • 4
    to add a bit to this nice answer, one can further do `new_vars = ['e', 'f', 'g']` and `df[new_vars] = df.apply(my_func, axis=1)` – Quetzalcoatl Dec 05 '20 at 03:23
44

Based on the excellent answer by @U2EF1, I've created a handy function that applies a specified function that returns tuples to a dataframe field, and expands the result back to the dataframe.

def apply_and_concat(dataframe, field, func, column_names):
    return pd.concat((
        dataframe,
        dataframe[field].apply(
            lambda cell: pd.Series(func(cell), index=column_names))), axis=1)

Usage:

df = pd.DataFrame([1, 2, 3], index=['a', 'b', 'c'], columns=['A'])
print df
   A
a  1
b  2
c  3

def func(x):
    return x*x, x*x*x

print apply_and_concat(df, 'A', func, ['x^2', 'x^3'])

   A  x^2  x^3
a  1    1    1
b  2    4    8
c  3    9   27

Hope it helps someone.

Dennis Golomazov
  • 16,269
  • 5
  • 73
  • 81
25

Some of the other people's answers contain mistakes, so I've summarized them below. The perfect answer is below.

Prepare the dataset. The version of pandas uses 1.1.5.

import numpy as np
import pandas as pd
import timeit

# check pandas version
print(pd.__version__)
# 1.1.5

# prepare DataFrame
df = pd.DataFrame({
    'x': [0.120117, 0.117188, 0.119141, 0.116211, 0.119141],
    'y': [0.987305, 0.984375, 0.987305, 0.984375, 0.983398],
    'z': [0.116211, 0.122070, 0.119141, 0.120117, 0.118164]},
    index=[
        '2014-05-15 10:38',
        '2014-05-15 10:39',
        '2014-05-15 10:40',
        '2014-05-15 10:41',
        '2014-05-15 10:42'],
    columns=['x', 'y', 'z'])
df.index.name = 'ts'
#                          x         y         z
# ts                                            
# 2014-05-15 10:38  0.120117  0.987305  0.116211
# 2014-05-15 10:39  0.117188  0.984375  0.122070
# 2014-05-15 10:40  0.119141  0.987305  0.119141
# 2014-05-15 10:41  0.116211  0.984375  0.120117
# 2014-05-15 10:42  0.119141  0.983398  0.118164

Solution 01.

Returns pd.Series in the apply function.

def myfunc1(args):
    e = args[0] + 2*args[1]
    f = args[1]*args[2] + 1
    g = args[2] + args[0] * args[1]
    return pd.Series([e, f, g])

df[['e', 'f', 'g']] = df.apply(myfunc1, axis=1)
#                          x         y         z         e         f         g
# ts
# 2014-05-15 10:38  0.120117  0.987305  0.116211  2.094727  1.114736  0.234803
# 2014-05-15 10:39  0.117188  0.984375  0.122070  2.085938  1.120163  0.237427
# 2014-05-15 10:40  0.119141  0.987305  0.119141  2.093751  1.117629  0.236770
# 2014-05-15 10:41  0.116211  0.984375  0.120117  2.084961  1.118240  0.234512
# 2014-05-15 10:42  0.119141  0.983398  0.118164  2.085937  1.116202  0.235327

t1 = timeit.timeit(
    'df.apply(myfunc1, axis=1)',
    globals=dict(df=df, myfunc1=myfunc1), number=10000)
print(round(t1, 3), 'seconds')
# 14.571 seconds

Solution 02.

Use result_type ='expand' when applying.

def myfunc2(args):
    e = args[0] + 2*args[1]
    f = args[1]*args[2] + 1
    g = args[2] + args[0] * args[1]
    return [e, f, g]

df[['e', 'f', 'g']] = df.apply(myfunc2, axis=1, result_type='expand')
#                          x         y         z         e         f         g
# ts                                                                          
# 2014-05-15 10:38  0.120117  0.987305  0.116211  2.094727  1.114736  0.234803
# 2014-05-15 10:39  0.117188  0.984375  0.122070  2.085938  1.120163  0.237427
# 2014-05-15 10:40  0.119141  0.987305  0.119141  2.093751  1.117629  0.236770
# 2014-05-15 10:41  0.116211  0.984375  0.120117  2.084961  1.118240  0.234512
# 2014-05-15 10:42  0.119141  0.983398  0.118164  2.085937  1.116202  0.235327

t2 = timeit.timeit(
    "df.apply(myfunc2, axis=1, result_type='expand')",
    globals=dict(df=df, myfunc2=myfunc2), number=10000)
print(round(t2, 3), 'seconds')
# 9.907 seconds

Solution 03.

If you want to make it faster, use np.vectorize. Note that args cannot be a single argument when using np.vectorize.

def myfunc3(args0, args1, args2):
    e = args0 + 2*args1
    f = args1*args2 + 1
    g = args2 + args0 * args1
    return [e, f, g]

df[['e', 'f', 'g']] = pd.DataFrame(np.row_stack(np.vectorize(myfunc3, otypes=['O'])(df['x'], df['y'], df['z'])), index=df.index)
#                          x         y         z         e         f         g
# ts                                                                          
# 2014-05-15 10:38  0.120117  0.987305  0.116211  2.094727  1.114736  0.234803
# 2014-05-15 10:39  0.117188  0.984375  0.122070  2.085938  1.120163  0.237427
# 2014-05-15 10:40  0.119141  0.987305  0.119141  2.093751  1.117629  0.236770
# 2014-05-15 10:41  0.116211  0.984375  0.120117  2.084961  1.118240  0.234512
# 2014-05-15 10:42  0.119141  0.983398  0.118164  2.085937  1.116202  0.235327

t3 = timeit.timeit(
    "pd.DataFrame(np.row_stack(np.vectorize(myfunc3, otypes=['O'])(df['x'], df['y'], df['z'])), index=df.index)",
    globals=dict(pd=pd, np=np, df=df, myfunc3=myfunc3), number=10000)
print(round(t3, 3), 'seconds')
# 1.598 seconds
Keiku
  • 8,205
  • 4
  • 41
  • 44
20

I've tried returning a tuple (I was using functions like scipy.stats.pearsonr which return that kind of structures) but It returned a 1D Series instead of a Dataframe which was I expected. If I created a Series manually the performance was worse, so I fixed It using the result_type as explained in the official API documentation:

Returning a Series inside the function is similar to passing result_type='expand'. The resulting column names will be the Series index.

So you could edit your code this way:

def myfunc(a, b, c):
    # do something
    return (e, f, g)

df.apply(myfunc, axis=1, result_type='expand')
Genarito
  • 3,027
  • 5
  • 27
  • 53
  • 1
    I like this one, and it seems the most pandaic, while only compatible with pandas >= 0.0.23 ([per Genarito's link to the api documentation](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.apply.html)) – spen.smith Feb 23 '20 at 22:28
  • 9
    if you wish to create two or three (or n) new columns in your dataframe, you can use: `df['e'], d['f'], d['g'] = df.apply(myfunc, axis=1, result_type='expand').T.values` – spen.smith Feb 23 '20 at 23:27
  • Can we use .apply to return more number of rows than present at df to create a diluted copy? Lets say df had 100 rows and function return 100 rows for each row and resultant dataframe should have 100*100 rows. Possible? – vevek seetharaman Apr 30 '20 at 14:24
  • Sincerely, I don't know. Perhaps the best you can do is another Stack Overflow question to have a custom best answer – Genarito Apr 30 '20 at 18:02
  • I had to use df['e'], d['f'], d['g'] = df.apply(myfunc, axis=1, result_type='expand').T.values as suggested by @spen.smith. Without that, the values of directly assigning columns was 0 and 1 (e.g. df["A"], df["B"] = df.apply(foo, axis=1, result_type="expand") with foo returning ["A", "B"] or ("A","B") would give the columns A and B the values 0 and 1, respectively. – chooks Nov 21 '20 at 21:06
  • Thank you! Works perfectly! – zenalc May 12 '21 at 20:52
13

Just return a list instead of tuple.

In [81]: df
Out[81]: 
                            x         y         z
ts                                               
2014-05-15 10:38:00  0.120117  0.987305  0.116211
2014-05-15 10:39:00  0.117188  0.984375  0.122070
2014-05-15 10:40:00  0.119141  0.987305  0.119141
2014-05-15 10:41:00  0.116211  0.984375  0.120117
2014-05-15 10:42:00  0.119141  0.983398  0.118164

[5 rows x 3 columns]

In [82]: def myfunc(args):
   ....:        e=args[0] + 2*args[1]
   ....:        f=args[1]*args[2] +1
   ....:        g=args[2] + args[0] * args[1]
   ....:        return [e,f,g]
   ....: 

In [83]: df.apply(myfunc ,axis=1)
Out[83]: 
                            x         y         z
ts                                               
2014-05-15 10:38:00  2.094727  1.114736  0.234803
2014-05-15 10:39:00  2.085938  1.120163  0.237427
2014-05-15 10:40:00  2.093751  1.117629  0.236770
2014-05-15 10:41:00  2.084961  1.118240  0.234512
2014-05-15 10:42:00  2.085937  1.116202  0.235327
Happy001
  • 6,103
  • 2
  • 23
  • 16
11

Found a possible solution, by changing myfunc to return an np.array like this:

import numpy as np

def myfunc(a, b, c):
    do something
    return np.array((e, f, g))

any better solution?

Fra
  • 4,918
  • 7
  • 33
  • 50
  • 1
    Returning numpy array seems the best in terms of performance. For 100K rows, returning numpy array to get DataFrame columns takes 1.55 seconds; using return Series takes 39.7 seconds. Significant difference in performance here – Praveen Dec 16 '19 at 03:41
5

Pandas 1.0.5 has DataFrame.apply with parameter result_type that can help here. from the docs:

These only act when axis=1 (columns):

‘expand’ : list-like results will be turned into columns.

 ‘reduce’ : returns a Series if possible rather than expanding list-like results. This 
 is the opposite of ‘expand’.

‘broadcast’ : results will be broadcast to the original shape of the DataFrame, the 
original index and columns will be retained.
Rachel Shalom
  • 399
  • 2
  • 10