0

I'm trying to extract outliers from my dataset and tag them accordingly.

Sample Data

     Doctor Name    Hospital Assigned         Region    Claims   Illness Claimed
1    Albert         Some hospital Center      R-1       20       Sepsis
2    Simon          Another hospital Center   R-2       21       Pneumonia
3    Alvin          ...                       ...       ...       ...
4    Robert
5    Benedict
6    Cruz

So I'm trying to group every Doctor that Claimed a certain Illness in a certain Region and trying to find outliers among them.

Doctor Name    Hospital Assigned         Region    Claims   Illness Claimed is_outlier
1    Albert    Some hospital Center      R-1       20       Sepsis         1
2    Simon     Another hospital Center   R-2       21       Pneumonia      0
3    Alvin       ...                       ...       ...       ...
4    Robert
5    Benedict
6    Cruz

I can do this in Power BI. But I can't seem to do it in R. I'm guessing that group_by() function of dplyr is involved. But I'm not sure.

This is what I'm trying to achieve:

sample

Algo goes like:

Read data
Group data by Illness
    Group by Region
    get IQR based on Claims Count
    if claims count > than (Q3 + 1.5) * IQR
        then tag it as outlier = 1
    else
        not an outlier = 0
Export data

I have done this before but this code loops through each Illnesses and applies Linear Regression for each. Is this anywhere near to what I'm trying to achieve?

# Loop through the dataframe and apply model
Ind <- sapply(split(df, list(df$Region,df$Illness_Code)), function(x)nrow(x)>1)

out <- lapply(
        split(df, list(df$Region, df$Illness_Code))[Ind],
         function(c){
          m <- lm(formula = COUNT ~ YEAR, data = c)
          coef(m)
         })

Any ideas?

Hibari
  • 131
  • 16

1 Answers1

1

One possible solution is to use group_by + boxplot_stats. The first will do all combinations of groups and the second will return the outliers values:

library(dplyr)

df <- data.frame(doc = sample(x = letters[1:3], size = 1000, replace = T), 
                 illness = sample(x = LETTERS[1:3], size = 1000, replace = T),
                 claims = rpois(n = 1000, lambda = 10))

df %>%
  group_by(doc, illness) %>%
  mutate(ind_out = if_else(claims %in% boxplot.stats(claims)$out, 1, 0))

# A tibble: 1,000 x 4
# Groups:   doc, illness [9]
   doc   illness claims ind_out
   <fct> <fct>    <int>   <dbl>
 1 c     A            8       0
 2 c     A           13       0
 3 b     C           18       0
 4 b     C            8       0
 5 b     C            8       0
 6 b     B           12       0
 7 a     C           10       0
 8 b     C            9       0
 9 a     B           15       0
10 c     B            8       0
# … with 990 more rows

I hope it works.

Douglas Mesquita
  • 1,011
  • 7
  • 12
  • 1
    Thanks, I'll try this. I have almost 2gb of .csv file. So this might take some time. I'll give you an update. – Hibari Jan 31 '19 at 02:59