13

I am using python2.7 and pandas 0.11.0.

I try to fill a column of a dataframe using DataFrame.apply(func). The func() function is supposed to return a numpy array (1x3).

import pandas as pd
import numpy as np

df= pd.DataFrame(np.random.randn(4, 3), columns=list('ABC'))
print(df)

              A         B         C
    0  0.910142  0.788300  0.114164
    1 -0.603282 -0.625895  2.843130
    2  1.823752 -0.091736 -0.107781
    3  0.447743 -0.163605  0.514052

The function used for testing purpose:

def test(row):
   # some complex calc here 
   # based on the values from different columns 
   return np.array((1,2,3))

df['D'] = df.apply(test, axis=1)

[...]
ValueError: Wrong number of items passed 1, indices imply 3

The funny is that when I create the dataframe from scratch, it works pretty well, and returns as expected:

dic = {'A': {0: 0.9, 1: -0.6, 2: 1.8, 3: 0.4}, 
     'C': {0: 0.1, 1: 2.8, 2: -0.1, 3: 0.5}, 
     'B': {0: 0.7, 1: -0.6, 2: -0.1, 3: -0.1},
     'D': {0:np.array((1,2,3)), 
          1:np.array((1,2,3)), 
          2:np.array((1,2,3)), 
          3:np.array((1,2,3))}}

df= pd.DataFrame(dic)
print(df)
         A    B    C          D
    0  0.9  0.7  0.1  [1, 2, 3]
    1 -0.6 -0.6  2.8  [1, 2, 3]
    2  1.8 -0.1 -0.1  [1, 2, 3]
    3  0.4 -0.1  0.5  [1, 2, 3]

Thanks in advance

Nic
  • 3,365
  • 3
  • 20
  • 31
  • 3
    You should avoid using `list`s/`tuple`s in `DataFrame`s or `Series`. Why not just have 3 columns in `df` or a separate `DataFrame` with your columns? – Phillip Cloud Sep 05 '13 at 16:49
  • 8
    I guess sometimes vector form is more natural for some quantity, e.g., coordinates. `df.endPoint-df.startPoint` is obviously more preferable to `np.c_[df.endX-df.startX, df.endY-df.startY, df.endZ-df.startZ]`. – herrlich10 Oct 29 '13 at 05:36

1 Answers1

14

If you try to return multiple values from the function that is passed to apply, and the DataFrame you call the apply on has the same number of item along the axis (in this case columns) as the number of values you returned, Pandas will create a DataFrame from the return values with the same labels as the original DataFrame. You can see this if you just do:

>>> def test(row):
        return [1, 2, 3]
>>> df= pd.DataFrame(np.random.randn(4, 3), columns=list('ABC'))
>>> df.apply(test, axis=1)
   A  B  C
0  1  2  3
1  1  2  3
2  1  2  3
3  1  2  3

And that is why you get the error, since you cannot assign a DataFrame to DataFrame column.

If you return any other number of values, it will return just a series object, that can be assigned:

>>> def test(row):
       return [1, 2]
>>> df= pd.DataFrame(np.random.randn(4, 3), columns=list('ABC'))
>>> df.apply(test, axis=1)
0    [1, 2]
1    [1, 2]
2    [1, 2]
3    [1, 2]
>>> df['D'] = df.apply(test, axis=1)
>>> df
          A         B         C       D
0  0.333535  0.209745 -0.972413  [1, 2]
1  0.469590  0.107491 -1.248670  [1, 2]
2  0.234444  0.093290 -0.853348  [1, 2]
3  1.021356  0.092704 -0.406727  [1, 2]

I'm not sure why Pandas does this, and why it does it only when the return value is a list or an ndarray, since it won't do it if you return a tuple:

>>> def test(row):
        return (1, 2, 3)
>>> df= pd.DataFrame(np.random.randn(4, 3), columns=list('ABC'))
>>> df['D'] = df.apply(test, axis=1)
>>> df
          A         B         C          D
0  0.121136  0.541198 -0.281972  (1, 2, 3)
1  0.569091  0.944344  0.861057  (1, 2, 3)
2 -1.742484 -0.077317  0.181656  (1, 2, 3)
3 -1.541244  0.174428  0.660123  (1, 2, 3)
Viktor Kerkez
  • 45,070
  • 12
  • 104
  • 85
  • 1
    Hi Viktor! thanks to answer. So if I understand you correctly ,there is no way to pass a numpy array? – Nic Sep 05 '13 at 16:54
  • 1
    @Nic If the length of the numpy array is not the same as the number of columns your code will work, but it's not intended to be used in such a way. As Phillip Cloud said you should avoid placing lists or arrays in your Series. You should create multiple Series (that is, multiple columns in your DataFrame). – Viktor Kerkez Sep 05 '13 at 16:59
  • Thanks guys. I'll then follow your advice, and go for 3 columns. @Phillip: sorry I missed your comment at first reading. – Nic Sep 05 '13 at 17:06
  • 4
    I wish to keep some array in the same dataframe, I wish there was a supported way to do this. – dashesy Oct 14 '14 at 20:29
  • Is there any alternative to pandas that would work ? I don't understand the point of not letting users choosing what object they want to put inside a dataframe. – agemO Mar 07 '18 at 09:38
  • @agemO supposedly you can still get numpy arrays by doing `df.apply(tuple_test, axis=1).apply(np.array)`, but I cannot get this to work for me yet. [See this SO thread](https://stackoverflow.com/q/45548426/3670871). – Engineero Mar 15 '18 at 21:50
  • An efficient and dirty solution is indeed to make a decorator that put any function output into a tuple, then turn back the tuple into the original output when operations (apply, agg) are done. This is what I do now on a regular basis – agemO Mar 16 '18 at 07:06