I have a regularly updated biological database currently containing records of 13,500 individuals across 250 species (species = factor levels in 1st column). Each individual has a unique ID (2nd column). For each individual, 7 different measurements were recorded (columns 3-9). With that many values entered into a database by hand, I'm sure some typos occurred and created outliers, e.g., a measurement that should read 15.2 could have been entered as 1.52 or 152 or 25.2. I'd like to identify those outliers so I can fix them rather than trashing them, but there are too many species to do it on a case by case basis. How can I automate and organize output of an outlier search for every measurement subset by every species? This last part is critical since each species could be a different size with drastically different measurements. I'm trying to streamline as much as possible since this is something that will likely be done every time a new batch of data is added to the database (or until someone springs for a filemaker license).
I'm analyzing in R. I think a nested for loop for all values outside 2 or 3 standard deviations of the mean would do the trick, and/or group_by with dplyr and the quantile function. But I haven't been able to figure out how to run all columns at once while returning the actual outlier values. There are a number of other questions addressing pieces of this, but I can't find any that puts it all together.
Example data:
df = data.frame(
species = c("a","b","a","b","a","b","a","b","a","b"),
uniqueID = c("x01","x02","x03","x04","x05","x06","x07","x08","x09","x10"),
metric1 = c(1,2,3,1,2,3,1,2,3,11),
metric2 = c(4,5,6,4,5,6,55,4,5,6),
metric3 = c(0.7,7,8,9,7,8,9,77,8,9)
)
As far as expected results go, I'm envisioning a data.frame or matrix reporting species, unique_ID, measurement/column with outlier, and the outlier value itself. But how that's formatted is less important, e.g.:
outliers = data.frame(
species = c("a","a","b","b"),
uniqueID = c("x01","x07","x08","x10"),
var = c("metric3","metric2","metric3","metric1"),
value = c(0.7,55,77,11)
)
Thanks in advance!