Detecting outliers within one column for ranges of rows

Question

In given data frame I have these two columns:

 neighbourhood_group
 price

Price column contains all the prices for all neighbourhood_group:

    neighbourhood_group price
 0  Brooklyn            149
 1  Manhattan           225
 2  Manhattan           150
 3  Brooklyn            89
 4  Manhattan           80
 5  Manhattan           200
 6  Brooklyn            60
 7  Manhattan           79
 8  Manhattan           79
 9  Manhattan           150

I am trying to detect outliers withing each neighbourhood_group.

The only idea I have come up to so far is to group by prices by neighbourhood_group, detect outliers within each group and create a mask for rows that needs to be dropped.

 data.groupby('neighbourhood_group')['price']

I suspect there might be an easier solution for that.

In this case none of your values are outliers right? – Erfan Aug 18 '19 at 21:16 — Erfan, Aug 18 '19 at 21:16

score 2 · Accepted Answer · answered Aug 18 '19 at 21:20

2

You can use Groupby.apply and then get all the values which are outside the range of 3 * std while substracting each value with the mean:

m = df.groupby('neighbourhood_group')['price'].apply(lambda x: x.sub(x.mean()).abs() <= (x.std()*3))

df[m]

Output

  neighbourhood_group  price
0            Brooklyn    149
1           Manhattan    225
2           Manhattan    150
3            Brooklyn     89
4           Manhattan     80
5           Manhattan    200
6            Brooklyn     60
7           Manhattan     79
8           Manhattan     79
9           Manhattan    150

note: in this case we get all the rows back, since there are no outliers.

answered Aug 18 '19 at 21:20

Erfan

40,971
8
66
78

there's an issue with your solution. You define outlier only if it's 3 std below the mean. What if it 3std above the mean? (also your output of m should be true/false) – adhg Aug 18 '19 at 21:32
@adhg: that is where the `abs()` comes in. – Willem Van Onsem Aug 18 '19 at 21:38
@Willem Van Onsem ok, I might be wrong here but the mean (should be) always positive so the abs doesn't provide much. no? the sub may be confuses me here. – adhg Aug 18 '19 at 21:49
@adhg: it is not the `mean` over which you `.abs()` it is the `x.sub(x.mean())` (so the difference between the mean and the `x`. – Willem Van Onsem Aug 18 '19 at 21:50
if the value is `2` and mean is `5` what does `2-5` give? and what does `abs(2-5)` give. You basically just want the distance of observation to the mean, negative or positive does not matter @adhg – Erfan Aug 18 '19 at 21:50
@ Erfan thanks for the clarification. I got it now. I see where the abs applies on. Roger Roger :-) – adhg Aug 18 '19 at 21:56
1

Thanks @Erfan. You right, shown snippet of values does not contain outliers. Your solution is most elegant and worked wonders, thanks! – Gara Aug 19 '19 at 11:05

score 1 · Answer 2 · answered Aug 18 '19 at 21:27

1

I think using groupby makes perfectly sense. I would then get the single groups, using get_group method for example. Finally you can do any analysis you need, see this example in case you missed it

Detect and exclude outliers in Pandas data frame

Cheers and good work, I'll follow the question as I'm interested too

answered Aug 18 '19 at 21:27

Peruz

403
3
10

respectfully, you're not providing any solution to the OP. I would strongly suggest you to provide your own work for the problem and not a link. (I'm quite sure Gara googled it before) – adhg Aug 18 '19 at 21:34

adhg · Answer 3 · 2019-09-06T13:41:14.007

I'll do it a bit manually:

let's assume your df is this (note I added 2 lines at the bottom)

    neighbourhood_group price
0   Brooklyn    149
1   Manhattan   225
2   Manhattan   150
3   Brooklyn    89
4   Manhattan   80
5   Manhattan   200
6   Brooklyn    60
7   Manhattan   79
8   Manhattan   79
9   Manhattan   150
10  Manhattan   28
11  Manhattan   280

let's add 2 column to facilitate here:

df['mean']=df.groupby('neighbourhood_group').transform('mean')
df['std'] = df.groupby('neighbourhood_group')['price'].transform('std')

let's ask for true/false if is_outlier

df['is_outlier'] = df.apply(lambda x: x['price']+x['std']<x['mean'] or x['price']-x['std']>x['mean'], axis=1)

result

    neighbourhood_group price   mean              std   is_outlier
0   Brooklyn            149     99.333333   45.390895   True
1   Manhattan           225     141.222222  82.308532   True
2   Manhattan           150     141.222222  82.308532   False
3   Brooklyn            89      99.333333   45.390895   False
4   Manhattan           80      141.222222  82.308532   False
5   Manhattan           200     141.222222  82.308532   False
6   Brooklyn            60      99.333333   45.390895   False
7   Manhattan           79      141.222222  82.308532   False
8   Manhattan           79      141.222222  82.308532   False
9   Manhattan           150     141.222222  82.308532   False
0   Manhattan           28      141.222222  82.308532   True
1   Manhattan           280     141.222222  82.308532   True

Also: note by @Willem Van Onsem the definition of an 'outlier' is usually 3 sigma above/below the mean. Consider this in your work and you can define your deviation from the mean (I used std=1)

Minor remark: in statistics, usually something is defined an outlier if there are three std.s or more between the mean and the value. — Willem Van Onsem, Aug 18 '19 at 21:56
@ Willem Van Onsem. 3 sigma - correct- I'll update my answer. — adhg, Aug 18 '19 at 21:59

Detecting outliers within one column for ranges of rows

3 Answers3