Finding "outliers" in a group

Question

I am working with hospital discharge data. All hospitalizations (cases) with the same Patient_ID are supposed to be of the same person. However I figured out that there are Pat_ID's with different ages and both sexes.

Imagine I have a data set like this:

Case_ID <- 1:8
Pat_ID <- c(rep("1",4), rep("2",3),"3")
Sex <- c(rep(1,4), rep(2,2),1,1)
Age <- c(rep(33,3),76,rep(19,2),49,15)
Pat_File <- data.frame(Case_ID, Pat_ID, Sex,Age)

Case_ID Pat_ID Sex Age
1       1      1   33
2       1      1   33
3       1      1   33
4       1      1   76
5       2      2   19
6       2      2   19
7       2      1   49
8       3      1   15

It was relatively easy to identify Pat_ID's with cases that differ from each other. I found these ID's by calculating an average for age and/or sex (coded as 1 and 2) with help of the function aggregate and then calculated the difference between the average and age or sex. I would like to automatically remove/identify cases where age or sex deviate from the majority of the cases of a patient ID. In my example I would like to remove cases 4 and 7.

http://www.r-bloggers.com/finding-outliers-in-numerical-data/ — rmuc8, Mar 14 '15 at 18:29
that link is for identifying statistical outliers, not unusual cases which is the question of the OP — jalapic, Mar 14 '15 at 18:31

score 2 · Answer 1 · answered Mar 14 '15 at 18:07

This method works, I believe, though I doubt it's the quickest or most efficient way.

Essentially I split the dataframe by your grouping variable. Then I found the 'mode' for the variables you're concerned about. Then we filtered those observations that didn't contain all of the modes. We then stuck everything back together:

library(dplyr) # I used dplyr to 'filter' though you could do it another way
temp <- split(Pat_File, Pat_ID)

Mode.Sex <- lapply(temp, function(x) { temp1 <- table(as.vector(x$Sex)); names(temp1)[temp1 == max(temp1)]})

Mode.Age <- lapply(temp, function(x) { temp1 <- table(as.vector(x$Age)); names(temp1)[temp1 == max(temp1)]})

temp.f<-NULL
for(i in 1:length(temp)){
  temp.f[[i]] <- temp[[i]] %>% filter(Sex==Mode.Sex[[i]] & Age==Mode.Age[[i]])
}

do.call("rbind", temp.f)

#  Case_ID Pat_ID Sex Age
#1       1      1   1  33
#2       2      1   1  33
#3       3      1   1  33
#4       5      2   2  19
#5       6      2   2  19
#6       8      3   1  15

score 2 · Accepted Answer · edited May 23 '17 at 11:50

2

You could try

library(data.table)

Using Mode from Is there a built-in function for finding the mode?

  Mode <- function(x) {
    ux <- unique(x)
    ux[which.max(tabulate(match(x, ux)))]
 }

setDT(Pat_File)[, .SD[Age==Mode(Age) & Sex==Mode(Sex)] , by=Pat_ID]
#    Pat_ID Case_ID Sex Age
#1:      1       1   1  33
#2:      1       2   1  33
#3:      1       3   1  33
#4:      2       5   2  19
#5:      2       6   2  19
#6:      3       8   1  15

Testing other cases,

 Pat_File$Sex[6] <- 1
 Pat_File$Age[4] <- 16
 setDT(Pat_File)[, .SD[Age==Mode(Age) & Sex==Mode(Sex)] , by=Pat_ID]
 #    Pat_ID Case_ID Sex Age
 #1:      1       1   1  33
 #2:      1       2   1  33
 #3:      1       3   1  33
 #4:      2       6   1  19
 #5:      3       8   1  15

edited May 23 '17 at 11:50

Community

1
1

answered Mar 14 '15 at 18:19

akrun

874,273
37
540
662

If for instance the Age outlier for Pat_ID is 16 rather than 76, then this doesn't work. – jalapic Mar 14 '15 at 18:23
1

Using `mode` instead of `mean` this now works for the case of checking ages. - Neat function. – jalapic Mar 14 '15 at 18:34
@jalapic I think it should work now. Tried by changing `Pat_File$Sex[6] <- 1;Pat_File$Age[4] <- 16`; Getting the same result as yours – akrun Mar 14 '15 at 18:41
I don't get this function, it all could be done just with `as.numeric(names(sort(-table(x)))[1])` – David Arenburg Mar 14 '15 at 18:53
@DavidArenburg Yes, that also works, I think `tabulate` might be a bit faster, not tested – akrun Mar 14 '15 at 18:58
Maybe, because `table` essentially calls `tabulate` but usually `table` is highly efficient. – David Arenburg Mar 14 '15 at 19:00
This one works best for my 2.7 *10^6 row data set. Thanks to all of you! – Roccer Mar 14 '15 at 19:18

ishelton · Answer 3 · 2015-03-16T02:40:43.600

Here is another approach using the sqldf package: 1) Create new dataframe (called data_groups) with unique groups based on Pat_ID, Sex, and Age 2) For each unique group, check Pat_ID against every other group and if the Pat_ID of one group matches another group, select the group with lower count and store in new vector (low_counts) 3) Take new datafame (data_groups) and take out Pat_IDs from new vector (low_counts) 4) Recombine with Pat_File

Here is the code:

library(sqldf)

# Create new dataframe with unique groups based on Pat_ID, Sex, and Age
data_groups <- sqldf("SELECT *, COUNT(*) FROM Pat_File GROUP BY Pat_ID, Sex, Age")
# Create New Vector to Store Pat_IDs with Sex and Age that differ from mode
low_counts <- vector()
# Unique groups
data_groups

for(i in 1:length(data_groups[,1])){
  for(j in 1:length(data_groups[,1])){
    if(i<j){
      k <- length(low_counts)+1
      result <- data_groups[i,2]==data_groups[j,2]
      if(is.na(result)){result <- FALSE}
      if(result==TRUE){
        if(data_groups[i,5]<data_groups[j,5]){low_counts[k] <- data_groups[i,1]}      
        else{low_counts[k] <- data_groups[j,1]}
      }
    }
  }
}

low_counts <- as.data.frame(low_counts)
# Take out lower counts
data_groups <- sqldf("SELECT * FROM data_groups WHERE Case_ID NOT IN (SELECT * FROM low_counts)")

Pat_File <- sqldf("SELECT Pat_File.Case_ID, Pat_File.Pat_ID, Pat_File.Sex, Pat_File.Age FROM data_groups, Pat_File WHERE data_groups.Pat_ID=Pat_File.Pat_ID AND data_groups.Sex=Pat_File.Sex AND data_groups.Age=Pat_File.Age ORDER BY Pat_File.Case_ID")

Pat_File

Which Provides the following results:

  Case_ID Pat_ID Sex Age
1       1      1   1  33
2       2      1   1  33
3       3      1   1  33
4       5      2   2  19
5       6      2   2  19
6       8      3   1  15

Finding "outliers" in a group

3 Answers3