2

I have a dataset with some outlier in the age field here is the unique values of my data sorted

unique = df_csv['AGE'].unique()
print (sorted(unique))

[21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 79, 126, 140, 149, 152, 228, 235, 267]

How can I replace any value greater than 80 with the mean or median of my Age column?

desertnaut
  • 57,590
  • 26
  • 140
  • 166
Amira Elsayed Ismail
  • 9,216
  • 30
  • 92
  • 175

3 Answers3

4

Since you want to work with a column in a dataframe, you should resolve to loc:

 # replace `median` with `mean` if you want
 df_csv.loc[df_csv['AGE']>80,'AGE'] = df_csv['AGE'].median()
Quang Hoang
  • 146,074
  • 10
  • 56
  • 74
1

You could do:

series[series > 80] = series.median()
print(series)

Output

0     21
1     22
2     23
3     24
4     25
      ..
58    52
59    52
60    52
61    52
62    52
Length: 63, dtype: int64
Dani Mesejo
  • 61,499
  • 6
  • 49
  • 76
0
median = df_csv['AGE'].median()
# using apply 
df_csv['AGE'].apply(lambda x: median if x>80 else x)

Other method: Here

ombk
  • 2,036
  • 1
  • 4
  • 16
  • To explain what apply does : lambda is a function without a name, that you could assign to it any function (similar to def ... but easier to use). lambda x, means select the value from the dataframe. then after the semi colon you have the condition: median if x>80, else keep x the same it goes over every row and does this check – ombk Nov 21 '20 at 01:47