1

How do I reference the dataframe to which a function is being applied inside the function applied.

For example, I have a dataframe named name_df. It has 4 columns (no specified index).

I have a function called calculate_stats that takes in several arguments (mixture of integer values and a df).

Inside calculate_stats I want to refer to name_df['name1'] and name_df['name2']

I did:

name_df.apply(calculate_stats, axis=1, args=(r, df,x,y,z))

And inside calculate_stats I use r['name1'] and r['name2'].

But got an error indicating NameError: name 'r' is not defined

In the following link they apply a function func1 to dataframe df. The argument that references each row in df is specified as r. So inside func1, columns of df can be referred by using r['colname']. How do I do the same with my function?

In [37]: df
Out[37]:
   X  Y  Count
0  0  1      2
1  0  1      2
2  1  1      2
3  1  0      1
4  1  1      2
5  0  0      1

In [38]: def func1(r):
   ....:     print(r['X'])
   ....:     print(r['Y'])
   ....:     return r
   ....:
Community
  • 1
  • 1
codingknob
  • 11,108
  • 25
  • 89
  • 126
  • The current row will always be the first argument passed to the function, and the the arguments in `args` will be passed after. – IanS Apr 07 '16 at 16:01

2 Answers2

2

The current row will always be the first argument passed to the function, and the the arguments in args will be passed after.

If I understand correctly what you are trying to do, this should work:

name_df.apply(calculate_stats, axis=1, args=(df, x, y, z))

This will calculate calculate_stats(r, df, x, y, z) where r is the current row of the dataframe that the function is being applied to.

IanS
  • 15,771
  • 9
  • 60
  • 84
  • yes this is exactly what I want to do. Your suggestion fixed that problem. However, now I've run into another problem. Is it possible to return multiple dataframes in an df.apply() operation? Or do I need to do something like df1, df2 = name_df.apply(calculate_stats, axis=1, args=(df, x, y, z)), which produced an error. – codingknob Apr 07 '16 at 16:10
  • This is the error: ValueError: need more than 1 value to unpack – codingknob Apr 07 '16 at 16:19
  • I don't think that's possible. You could have `apply` return a dataframe `df` with multiple columns, and then do for instance `df1 = df['col1']` and `df2 = df['col2']`. – IanS Apr 07 '16 at 17:21
  • From the [documentation](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.apply.html), `apply` returns a Series or a DataFrame. You can't return two dataframes. – IanS Apr 07 '16 at 17:40
  • Gotcha. So I changed calculate_stats() to produce one dataframe. However, I now get the following error: ValueError: cannot copy sequence with size 45 to array axis with dimension 6. – codingknob Apr 07 '16 at 17:42
  • If I run calculate_stats() by itself by passing in the first row of name_df then everything works great. I get the result df. However, it doesn't work when used via name_df.apply() – codingknob Apr 07 '16 at 17:43
  • Are you perhaps running into [this issue](http://stackoverflow.com/questions/16236684/apply-pandas-function-to-column-to-create-multiple-new-columns)? If `calculate_stats` returns multiple values, you need to transform them into a pandas Series in order for `apply` to return a dataframe with multiple columns. – IanS Apr 07 '16 at 17:58
  • Here is the relevant documentation: http://pandas-docs.github.io/pandas-docs-travis/groupby.html#flexible-apply – IanS Apr 07 '16 at 18:04
0

Did you try using lambda like for instance:

 name_df['concat'] = name_df.apply(lambda x: x['name1'] + x['name2'])

x would be the current row as a dict

Till
  • 4,183
  • 3
  • 16
  • 18
  • I didn't use lambda because calculate_stats() is a complicated function. It performs many operations on input dataframe df and produces several data frames as a result. Basically I want to extract name_df['name1'] and name_df['name2'] and iterate over every row in name_df and perform operations for every name1 and name2 combination. – codingknob Apr 07 '16 at 16:04