0

I collected data (5 variables) for 1000 items.

#example data (my data is not neccesarily multivariate normal!)
data <- rbind(c(7.2, 9, 14.1, 22.3, 3.9),
              cbind(x1=rnorm(999,10,1), x2=rnorm(999,8,0.2), x3=rnorm(999,12.4,1.2), x4=rnorm(999,17.8,1.09), x5=rnorm(999,8.9,2.1)))

Since I suspect the first item (1st row, data[1,]) to belong to the same multivariate distribution, I'd like to calculate the probability that this specific item has been drawn from the empirical distribution (estimated by the remaining 999 x 5 item values, data[-1,]).

How can I estimate this probability using R? Every idea is appreciated!

Thanks a lot in advance for your help!

Anti
  • 365
  • 1
  • 14
  • I do have some ideas about this topic, but I think you need to be clearer on what you are asking. What would it mean if someone gave you a direct answer, say that there is a 30% chance the 1st point came from the distribution implied by the rest of the points? Where could it have come from the other 70% of the time? – pseudospin Jan 11 '21 at 18:34
  • @pseudospin I'm going to identify "outliers". I got a big data set (25 measurements from ~45000 individuals) and have seen that there are some odd, physically impossible, values. I'm sure they were caused by transcription errors (typos, "forgotten" digits, etc.). Thus, I'd like to somehow identify these suspect numbers to compare it with raw data (from hand-written manuscripts and different publications) ... – Anti Jan 11 '21 at 19:07
  • 1
    Outlier detection is a huge area. There's a bunch of ideas out there, e.g. [here](https://stats.stackexchange.com/questions/213/what-is-the-best-way-to-identify-outliers-in-multivariate-data) – pseudospin Jan 13 '21 at 15:13

0 Answers0