In R, how do you generate a vector (data) with outliers? Great if the data is "acceptable" normal distributed.
Asked
Active
Viewed 2,383 times
1
-
3You can combine the various RNGs in R like `runif`, `rnorm`, `rgamma` to get a mixture model that is "acceptably" normal with some added noise. That said, your question is too broad for this forum. Please be more specific. – Ferdinand.kraft Sep 18 '13 at 20:28
-
In my opinion it is a worthwhile question to ask. I did not ask for a syntax example. Just a hint. Thus, your degree of detail is appropriate and a sound answer, too. Thank you. – feder Sep 18 '13 at 20:38
-
@feder your question could also be closed under the *off topic: Questions asking for code must demonstrate a minimal understanding of the problem being solved*, as well as the too broad category. Please see [**how to make a great reproducible example**](http://stackoverflow.com/q/5963269/1478381) for more tips on asking a well formed question. – Simon O'Hanlon Sep 18 '13 at 20:44
-
Too broad. In all honesty you could argue `rnorm(100)` will produce outliers by definition. – Señor O Sep 18 '13 at 22:16
-
I concur. Of course a distribution creates outliers. otherwise it would not be a distribution (having a default value 1) and thus it would be simply multiple observation of the very exact occurance. I'm new to this R-Tag and the R software at all. Hence, I simply assumed that people answering quetions would simply IMPLY that I'm looking for an answer as Ferdinand, Dwin and gung have recommended. i.e. a graph with a small kurtosis or skewness. There should be nothing wrong with general questions, if not asking for more than a general answer. But that is my humble opinion valid for every context – feder Sep 19 '13 at 06:20
2 Answers
3
@DWin is right that this depends on what you mean by "outlier". For the record, I use the same definition that he is using, so I would use (have used) something like the code he, and @Ferdinand.kraft, list. Others sometimes mean a datum more extreme than you might typically find. This is tricky to define for a simulation study, but a common definition is a point more than 1.5 times the interquartile range past the 1st (3rd) quartile. Here is a simple way to find that (I'm sure there will be more efficient ways):
flag <- 0
while(flag==0){
X <- rnorm(N)
bp <- boxplot(X, plot=FALSE)
if(length(bp$out)!=0){
flag <- 1
}
}

gung - Reinstate Monica
- 11,583
- 7
- 60
- 79
1
This really depends on the definition of "outlier";
c(rnorm(100), 100, -100) # an egregious example
plot(density( c( rnorm(90), rnorm(5, 1) ) ) ) # not as egregious

IRTFM
- 258,963
- 21
- 364
- 487