2

I'm looking to do some basic simulation in R to examine the nature of p-values. My goal is to see whether large sample sizes trend towards small p-values. My thought is to generate random vectors of 1,000,000 data points, regress them on each other, and then plot the distribution of p-values and look for skew.

This is what I was thinking so far:

x1 = runif(1000000, 0, 1000) 
x2 = runif(1000000, 0, 1000) 
model1 = lm(x2~x1)

Using code taken from another thread:

lmp <- function (modelobject) {
    if (class(modelobject) != "lm") stop("Not an object of class 'lm' ")
    f <- summary(modelobject)$fstatistic
    p <- pf(f[1],f[2],f[3],lower.tail=F)
    attributes(p) <- NULL
    return(p)
  }
lmp(model1) 
0.3874139

Any suggestions for how I might do this for 1000 models or even more? Thanks!

macworthy
  • 95
  • 8
  • These posts might be useful: http://stackoverflow.com/q/29803993/1989480 and http://stackoverflow.com/questions/36571864/why-the-built-in-lm-function-is-so-slow-in-r – chinsoon12 Apr 24 '16 at 07:30

1 Answers1

1

see ?replicate ... but the p-value you're calculating assumes gaussian errors not uniform ones (not that this will matter much at n=10^6)

Specifically, something like this:

nrep <- 1000
ndat <- 1000000
results <- replicate(nrep, {
     x1=runif(ndat, 0, 1000);
     x2=runif(ndat, 0, 1000);
     model1=lm(x1 ~ x2);
     lmp(model1)
     })

should work, but it will take a long time to run.

I'd suggest making nrep and ndat smaller to try it out.

Glen_b
  • 7,883
  • 2
  • 37
  • 48