Run a function by groups

Question

I'm currently working on removing outliers and I'm using Klodian Dhana's function on outlier subject (https://datascienceplus.com/identify-describe-plot-and-removing-the-outliers-from-the-dataset/#comment-3592066903).

My dataset consists of 95000 observations divided into 1050 groups, and I'm wondering if there is a way to check the outliers by the group, and not going for the formula 1050 times.

Data(DF)

Group   Height 
 Gr1    2
 Gr1    5
 Gr1    5
 Gr2    75
 Gr2    72
 Gr2    44
 Gr3    4
 Gr3    25
 Gr3    42
 …      …
 Gr1050 43

So I would like to check the outlier formula by the group, but to have it in a single DF.

I'm not very expert so I did my research and found that the by(), ddply(), and tapply() functions could be used in this case. I think also that loops could be useful.

Including a [minimal reproducible example](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) in your question will increase your chances of getting an answer. — Samuel, Nov 01 '17 at 00:05
The link you provide uses a simple interquartile range (IQR) filter to flag outliers. You don't provide any sample data, but I would split `DF` by `Group` to get a list of dataframes `ll <- split(DF, DF$Group)`; then use `lapply(ll, function(x) ...)` to flag entries that lie outside of the 1.5 IQR (see `?IQR`) per group. — Maurits Evers, Nov 01 '17 at 00:47

Maurits Evers · Answer 1 · 2017-11-01T01:39:24.033

1

Have a look at the following example. I generate some sample data by sampling from a Cauchy distribution to get a wide enough tail which gives us outliers according to Tukey's IQR outlier criterion.

# Sample data
set.seed(2017);
df <- cbind.data.frame(
    Group = rep(c("Gr1", "Gr2", "Gr3"), each = 20),
    Height = unlist(lapply(c(10, 20, 30), function(x) rcauchy(20, x))));
head(df);
#  Group    Height
#1   Gr1  9.757403
#2   Gr1  1.476820
#3   Gr1 20.300998
#4   Gr1 11.277766
#5   Gr1  9.118874
#6   Gr1  9.133723

# Split based on Group
ll <- split(df, df$Group);

# Flag entries based on 1.5 IQR
ll <- lapply(ll, function(x) {
    x$outlier <- ifelse(
        x$Height < quantile(x$Height, 0.25) - 1.5 * IQR(x$Height) |
        x$Height > quantile(x$Height, 0.75) + 1.5 * IQR(x$Height),
        TRUE,
        FALSE);
    return(x);
})

# Optionally replace outiers with NA
ll <- lapply(ll, function(x) {
    x$Height[x$outlier] <- NA;
    return(x);
});

# Optionally combine into single dataframe
df.filtered <- do.call(rbind.data.frame, ll);
head(df.filtered);
#      Group    Height outlier
#Gr1.1   Gr1  9.757403   FALSE
#Gr1.2   Gr1        NA    TRUE
#Gr1.3   Gr1        NA    TRUE
#Gr1.4   Gr1 11.277766   FALSE
#Gr1.5   Gr1  9.118874   FALSE
#Gr1.6   Gr1  9.133723   FALSE

Visualise the distributions before and after the outlier detection analysis.

# Show a comparative plot
library(ggplot2);
df.all <- rbind.data.frame(
    cbind.data.frame(df, src = "pre-outlier analysis"),
    cbind.data.frame(df.filtered[, -3], src = "post-outlier analysis"));
gg <- ggplot(df.all, aes(x = Group, y = Height));
gg <- gg + geom_boxplot() + facet_wrap(~ src, scale = "free_y");

edited Nov 01 '17 at 01:39

answered Nov 01 '17 at 01:29

Maurits Evers

49,617
4
47
68

No need for `split` and `lapply` as `by` (the object-oriented wrapper of `tapply`) can handle both! – Parfait Nov 01 '17 at 01:35
True and thanks. I was more illustrating how to implement the IQR outlier criterion yourself without having to do the `boxplot` detour. – Maurits Evers Nov 01 '17 at 01:38
Dear @MauritsEvers thanks alot for a very helpful answer! – Edgar Esteban Herrera Collazos Nov 01 '17 at 02:51
Dear @MauritsEvers When running the "Flag entries based on 1.5 IQR" the following message pops up: Error in quantile.default(x$Height, 0.25) : missing values and NaN's not allowed if 'na.rm' is FALSE. What can I do here? – Edgar Esteban Herrera Collazos Nov 01 '17 at 03:00
It seems your data contains `NA`'s. You can use `quantile(..., na.rm = TRUE)` and `IQR(..., na.rm = TRUE)`, or remove `NA` entries from your original dataframe, e.g. `DF <- DF[complete.cases(DF), ]`. – Maurits Evers Nov 01 '17 at 03:06
Dear @MauritsEvers using the ´na.rm´parameter was really helpful, thanks! – Edgar Esteban Herrera Collazos Nov 01 '17 at 03:37
No worries. Glad to help. PS. Please accept the answer to close the question. – Maurits Evers Nov 01 '17 at 03:54

score 0 · Answer 2 · answered Nov 01 '17 at 01:40

Here is my approach, which leverages the tidyverse package dplyr:

        #--create dataframe

    Group = c("Gr1", "Gr1", "Gr1","Gr1",
              "Gr1", "Gr1", "Gr1","Gr1", 
              "Gr2", "Gr2", "Gr2","Gr2",
              "Gr2", "Gr2", "Gr2","Gr2",
              "Gr3", "Gr3", "Gr3","Gr3",
              "Gr3", "Gr3", "Gr3","Gr3")

    Height = c(1,21,22,23,
              241,24,29,30,
               2,50,49,50,
               51,50,4900,50,
               10,10,3000,10,
               10,10,2,10) 

    grp_df = data.frame(Group, Height)

    library(dplyr) #--for group_by and summarise functions

    library(outliers) #--for outlier function

    new_df <- grp_df %>%
      group_by(Group) %>%
      summarise(lower_outlier = outlier(Height, opposite=TRUE),
                higher_outlier = outlier(Height, opposite=FALSE))

Run a function by groups

2 Answers2