0

I have a regularly updated biological database currently containing records of 13,500 individuals across 250 species (species = factor levels in 1st column). Each individual has a unique ID (2nd column). For each individual, 7 different measurements were recorded (columns 3-9). With that many values entered into a database by hand, I'm sure some typos occurred and created outliers, e.g., a measurement that should read 15.2 could have been entered as 1.52 or 152 or 25.2. I'd like to identify those outliers so I can fix them rather than trashing them, but there are too many species to do it on a case by case basis. How can I automate and organize output of an outlier search for every measurement subset by every species? This last part is critical since each species could be a different size with drastically different measurements. I'm trying to streamline as much as possible since this is something that will likely be done every time a new batch of data is added to the database (or until someone springs for a filemaker license).

I'm analyzing in R. I think a nested for loop for all values outside 2 or 3 standard deviations of the mean would do the trick, and/or group_by with dplyr and the quantile function. But I haven't been able to figure out how to run all columns at once while returning the actual outlier values. There are a number of other questions addressing pieces of this, but I can't find any that puts it all together.

Example data:

df = data.frame(
  species = c("a","b","a","b","a","b","a","b","a","b"),
  uniqueID = c("x01","x02","x03","x04","x05","x06","x07","x08","x09","x10"),
  metric1 = c(1,2,3,1,2,3,1,2,3,11),
  metric2 = c(4,5,6,4,5,6,55,4,5,6),
  metric3 = c(0.7,7,8,9,7,8,9,77,8,9)
)

As far as expected results go, I'm envisioning a data.frame or matrix reporting species, unique_ID, measurement/column with outlier, and the outlier value itself. But how that's formatted is less important, e.g.:

outliers = data.frame(
  species = c("a","a","b","b"),
  uniqueID = c("x01","x07","x08","x10"),
  var = c("metric3","metric2","metric3","metric1"),
  value = c(0.7,55,77,11)
)

Thanks in advance!

quepur
  • 78
  • 10
  • Do the `metric` variables have a known range of possible values? That is, do you identify 0.7 and 77 in `metric3` as an outlier because they are impossible values (e.g. `metric3` can only range from 1-10) or rather because they're unrealistic values given the typical values listed (which are otherwise only 7s, 8s, or 9s)? – coip May 13 '19 at 22:03
  • @coip, Yes and no. Ranges exist for each metric for each species, but I don't know them with out calculating them, and they vary by species. To give you some more details, these are measurements of bird morphology. So metric 1 (bill length) for species x could lie between 10-15mm and between 80-100mm for species y. Values outside those limits should not be possible except for by human error during data entry. I just made up the values above figuring it would be easier. – quepur May 13 '19 at 22:22
  • @quepur have you had a chance to consider my answer? Is there something that could make it better? – Benjamin May 15 '19 at 00:12
  • 1
    @Benjamin, that was great, thank you. There is one minor problem that you could maybe edit in case other people find this useful. By replicating the data to keep the SD calculation alive, you replicated the unique IDs, which doesn't fly. I altered your code a bit after tripling the rows per your suggestion by setting df2$uniqueID to equal a vector of unique IDs from x01-x31 and then removing uniqueID from group_by. This results in the desired output! Thanks again for your help. – quepur May 16 '19 at 17:14

1 Answers1

1

To start with the data you provided...

df = data.frame(
  species = c("a","b","a","b","a","b","a","b","a","b"),
  uniqueID = c("x01","x02","x03","x04","x05","x06","x07","x08","x09","x10"),
  metric1 = c(1,2,3,1,2,3,1,2,3,11),
  metric2 = c(4,5,6,4,5,6,55,4,5,6),
  metric3 = c(0.7,7,8,9,7,8,9,77,8,9)
)

I'm going to use the tidyverse liberally here...

library(tidyverse)

Then to triple the rows so the standard deviation calculation doesn't die on us, and to add another outlier row...

df2 <- df %>% 
  bind_rows(df) %>% 
  bind_rows(df) %>% 
  add_row(
    species = "a",
    uniqueID = "x01",
    metric1 = 1,
    metric2 = 4,
    metric3 = 1e12
  )

What if you tried something like this?

df2 %>% 
  gather(key = "metric", value = "value", -species, -uniqueID) %>% 
  group_by(species, uniqueID, metric) %>% 
  arrange(species, uniqueID, metric) %>% # just to make the results easy to scan
  mutate(
    mean_obs = mapply(function(x) mean(value[-x]), 1:n()),
    stdev    = mapply(function(x)   sd(value[-x]), 1:n()),
    minimum  = mean_obs - stdev * 2,
    maximum  = mean_obs + stdev * 2,
    outlier  = value < minimum | value > maximum
  ) %>% 
  filter(outlier) %>% 
  glimpse()

It borrows from this answer to find the mean and standard deviation excluding the current record, and then marks a row as an outlier if the row is more than 2 SD from the mean.

It can get weird if you exclude the current record and the record is not an outlier and it appreciably changes the mean and standard deviation. But then if the record is an outlier, you definitely want to do that. :)

Benjamin
  • 876
  • 4
  • 8
  • I think the key insight here is that you can `gather()` first, then group by both species / uniqueID _and_ metric, then use use whatever function or functions you like to set the range, using `mutate()` instead of `summarise()` so as not to lose the raw observations. – Benjamin May 13 '19 at 22:34