It's a bit unclear what you're looking for (this should probably be on Cross Validated), but here's a start and an approximate description of linear regression.
Let's say I have some datapoints that are 3 dimensional (Noise
, PC1
, PC2
), and you say there's 45 of them.
x=data.frame(matrix(rnorm(3*45),ncol=3))
names(x)<-c('Noise','PC1','PC2')
These data are randomly distributed around this 3 dimensional space. Now we imagine there's another variable that we're particularly interested in called Trait
. We think that the variations in each of Noise
, PC1
, and PC2
can explain some of the variation observed in Trait
. In particular, we think that each of those variables is linearly proportional to Trait
, so it's just the basic old y=mx+b
linear relationship you've seen before, but there's a different slope m
for each of the variables. So in total we imagine Trait = m1*Noise + m2*PC1 + m3*PC2 +b
plus some added noise (it's a shame one of your variables is named Noise
, that's confusing).
So going back to simulating some data, we'll just pick some values for these slopes and put them in a vector called beta
.
beta<-c(-3,3,.1) # these are the regression coefficients
So the model Trait = m1 Noise + m2 PC1 + m3 PC2 +b
might also be expressed with simple matrix multiplication, and we can do it in R with,
trait<- as.matrix(x)%*%beta + rnorm(nrow(x),0,1)
where we've added Gaussian noise of standard deviation equal to 1.
So this is the 'simulated data' underlying a linear regression model. Just as a sanity check, let's try
l<-lm(trait~Noise+PC1+PC2,data=x)
summary(l)
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.13876 0.11159 1.243 0.221
Noise -3.08264 0.12441 -24.779 <2e-16 ***
PC1 2.94918 0.11746 25.108 <2e-16 ***
PC2 -0.01098 0.10005 -0.110 0.913
So notice that the slope we picked for PC2
was so small (0.1
) relative to the overall variability in the data, that it isn't detected as a statistically significant predictor. And the other two variables have opposite effects on Trait
. So in simulating data, you might adjust the observed ranges of the variables, as well at the magnitudes of the regression coefficients beta
.