0

I have the following DataFrame in Pandas:

import pandas as pd
import numpy as np

df = pd.DataFrame([(1, 1, 1, 0),
                   (2, 0, 0, 2),
                   (3, 0, 1, 3),
                   (4, 5, 3, 0)],
                  columns=list('abcd'))

I need to implement the following function into that DataFrame:

enter image description here

I'm trying to use the apply() function below:

dfs = df.apply(lambda x: np.mean(x)+2*np.std(x) if x > np.mean(x)+2*np.std(x) else x, axis = 0, result_type='broadcast')
dfs

I'm getting the following error:

ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

Not really sure what it means, or where should i use those a.empty, a.bool()... to fix it.

Murilo
  • 533
  • 3
  • 15
  • it is because right now, the `x` in your lambda represent a column (a series) and the operation `x > np.mean(x)+2*np.std(x)` is a series too. The problem is that using `if series` return this error. see [this](https://stackoverflow.com/questions/36921951/truth-value-of-a-series-is-ambiguous-use-a-empty-a-bool-a-item-a-any-o) for more explanation about different cases this error happens – Ben.T Nov 10 '21 at 13:35

2 Answers2

0

If you want to check row by row then you can use np.where instead of if else in your program. First parameter is your condition. When it is true it takes the second parameter at the same index. If it is wrong it takes the third parameter at the same index.

df.apply(lambda x:np.where(x > np.mean(x)+2*np.std(x), np.mean(x)+2*np.std(x), x), axis=0)
alparslan mimaroğlu
  • 1,450
  • 1
  • 10
  • 20
0

You can use clip after calculating the mean and std on whole dataframe at once.

df.clip(upper=df.mean()+2*df.std(), axis=1)

with the current input, it does not change anything, here is a way to see it:

# calcualte the current upper bound
_upper = df.mean() + 2*df.std()
print(_upper)
# a    5.081989
# b    6.260952
# c    3.766611
# d    4.250000
# dtype: float64

# then replace two values above the bound
df.loc[2,['a','b']] = [12,9]
print(df)
# dtype: float64
#     a  b  c  d
# 0   1  1  1  0
# 1   2  0  0  2
# 2  12  9  1  3 # see the values in column a and b
# 3   4  5  3  0

# see what clip does for the values in column a and b, index 2
print(df.clip(upper=_upper, axis=1))
#           a         b  c  d
# 0  1.000000  1.000000  1  0
# 1  2.000000  0.000000  0  2
# 2  5.081989  6.260952  1  3  # 12 and 9 replaced by the upper bound of the column
# 3  4.000000  5.000000  3  0
Ben.T
  • 29,160
  • 6
  • 32
  • 54