2

I have a data set that needs to be cleaned from mistakes. For that, I have a sub-data set that contains only observations that I know are correct ("Match"). I would like to draw a 95% confidence ellipse around those correct observations on a plot and exclude all observations out of the ellipse from my main data set. I figured out how to draw it but now I would like to be able to take out data based on that.

I'm a beginner with R so all of that is pretty new to me so I might not understand complicated coding. :)

Thanks !

enter image description here

To add more details, my data are measurements of collembolas (a type of insect). It has this basic structure:

  replicate node day MajorAxisLengtnh MinorAxisLength Data.type
1         1    1  50              2.1             0.4     Match
2         2    1  50              2.3             0.2   Unknown

Therefore, I want to validate measurements by excluding unrealistic aspect ratios (length/width). Using the subset that I know is correct (match observations), I want to determine a reasonable range of aspect ratios for collembola, and use it to remove any unrealistic observation. I was advised to use a 95% confidence ellipse for good observations and take out observations that don't fit in the ellipse.

Ian Campbell
  • 23,484
  • 14
  • 36
  • 57
lorena
  • 31
  • 3
  • 2
    It's easier to help you if you include a simple [reproducible example](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) with sample input and desired output that can be used to test and verify possible solutions. What modeling assumptions do you want to make to calculate a confidence interval? – MrFlick Jul 14 '20 at 22:15

1 Answers1

2

The SIBER package has some functions to help you here.

library(SIBER)

Let's use the iris dataset, plotting sepal width vs length.

dat <- iris[,1:2]
plot(dat)

mu <- colMeans(dat)
Sigma <- cov(dat) 

addEllipse(mu, Sigma, p.interval = 0.95, col = "blue", lty = 3)

Z <- pointsToEllipsoid(dat, Sigma, mu)  # converts the data to ellipsoid coordinates
out <- !ellipseInOut(Z, p = 0.95)  # logical vector
(outliers <- dat[out,])  # finds the points outside the ellipse

#    Sepal.Length Sepal.Width
#16           5.7         4.4
#34           5.5         4.2
#42           4.5         2.3
#61           5.0         2.0
#118          7.7         3.8
#132          7.9         3.8

points(outliers, col="red", pch=19)

enter image description here

You can then use the out vector to remove unwanted rows.

dat.in <- dat[!out,]
Edward
  • 10,360
  • 2
  • 11
  • 26