4

I'm trying to understand how to identify statistical outliers which I will be sending to a spreadsheet. I will need to group the rows by the index and then find the stdev for specific columns and anything that exceeds the stdev would be used to populate a spreadsheet.

df = pandas.DataFrame({'Sex': ['M','M','M','F','F','F','F'], 'Age': [33,42,19,64,12,30,32], 'Height': ['163','167','184','164','162','158','160'],})

Using a dataset like this I would like to group by sex, and then find entries that exceed either the stdev of age or height. Most examples I've seen are addressing the stdev of the entire dataset as opposed to broken down by columns. There will be additional columns such as state, so I don't need the stdev of every column just particular ones out of the set.

Looking for the ouput to just contain the data for the rows that are identified as statistical outliers in either of the columns. For instance:

0  M  64  164
1  M  19  184

Assuming that 64 years old exceeds the men's stdevs set for height and 184 cm tall exceeds the stdevs for men's height

Thomas
  • 103
  • 1
  • 2
  • 8
  • Okay, here's a question for you. Do you mean wrt MEAN or wrt STD? Because STD isn't the right thing to filter, imo. – cs95 Aug 28 '17 at 22:31
  • It would've been helpful if you'd put out some expected output too. – cs95 Aug 28 '17 at 22:35
  • cmaher -Well, as I'm completely new to pandas and fairly new to python, I was originally pursuing grouping them and then finding the stdev of each column for the group and I was going to use a for statement to iterate through each row to check to see if the stdev is greater than the mean, but from poking around pandas questions more, it looks like something this can handle without that excessive coding. coldspeed -I'm not sure I understand your question fully, but I want any row that the values exceeds 3 stdevs. The stdevs are by group, so there would be different stdevs for m than for f. – Thomas Aug 28 '17 at 22:36
  • The stddev for height is 11. Everything exceeds 11. That's not what you mean I'm guessing. – cs95 Aug 28 '17 at 22:41
  • added sample output format. Correct, I'm wanting anything that exceeds 3 stdevs from the mean. Sorry, I assumed that was implied with the use of stdevs. – Thomas Aug 28 '17 at 22:42

1 Answers1

6

First, convert your height from strings to values.

df['Height'] = df['Height'].astype(float)

You then need to group on Sex using transform to create a boolean indicator marking if any of Age or Height is a statistical outlier within the group.

stds = 1.0  # Number of standard deviation that defines 'outlier'.
z = df[['Sex', 'Age', 'Height']].groupby('Sex').transform(
    lambda group: (group - group.mean()).div(group.std()))
outliers = z.abs() > stds
>>> outliers
     Age Height
0  False  False
1  False  False
2   True   True
3   True   True
4   True  False
5  False   True
6  False  False

Now filter for rows that contain any outliers:

>>> df[outliers.any(axis=1)]
   Age  Height Sex
2   19     184   M
3   64     164   F
4   12     162   F
5   30     158   F

If you only care about the upside of the distribution (i.e. values > mean + 2 SDs), then just drop the .abs(), i.e. lambda group: (group - group.mean()).div(group.std()).abs() > stds

Alexander
  • 105,104
  • 32
  • 201
  • 196