5

I have an experimental data set which I'd like to fit to a polynomial. The data comprise an independent variable, the dependent variable and an uncertainty in the measurement of the latter, for example

2000  0.2084272   0.002067834
2500  0.207078125 0.001037248
3000  0.2054202   0.001959138
3500  0.203488075 0.000328942
4000  0.2013152   0.000646088
4500  0.198933825 0.001375657
5000  0.196375    0.000908696
5500  0.193668575 0.00014721
6000  0.1908432   0.000526976
6500  0.187926325 0.001217318
7000  0.1849442   0.000556495
7500  0.181921875 0.000401883
8000  0.1788832   0.001446992
8500  0.175850825 0.001235017
9000  0.1728462   0.001676249
9500  0.169889575 0.001011735
10000 0.167       0.000326678

(columns x, y, +-y).

I can carry out a polynomial fit using the above with for example

mydata = read.table("example.txt")
model <- lm(V2~V1+I(V1^2)+I(V1^3)+I(V1^4), data = mydata)

but this does not make use of the uncertainty values. How do I inform R that the third column of the data set is an uncertainty and that it should therefore be used in the regression analysis?

Joseph Wright
  • 2,857
  • 22
  • 32
  • Note that the data here are 'simulated', based on but not using the real thing. – Joseph Wright Dec 22 '15 at 14:57
  • How would you like the uncertainty to be used? As another independent variable? As something else? And please do yourself a favor, and don't get into the habit of attaching data. It clutters your environment and can lead to issues with grouped data and sorting for example. – Heroka Dec 22 '15 at 15:01
  • @Heroka It's quite standard in graphical analyses programs (Origin, Igor, ...) to have a column used in analysis as the uncertainty: I'm no statistician so I don't know how it's used beyond that. On the 'attaching' I guess you mean `attach(data)`: I've got that from _The R Book_ (2nd ed., _e.g._ page 467), so assume(d) it's standard. – Joseph Wright Dec 22 '15 at 15:10
  • the uncertainty-thing is outside my knowledge, sorry. On attaching: it is not standard, but very often taught (especially in beginner courses). In my humble opinion, it's not a good habit to have because over time your analyses will become more complex. (grouped operations, sorting, exporting data and multiple datasets are all easier with everything contained in one object. It will prevent you from making hard to debug mistakes. [here](http://stackoverflow.com/questions/10067680/why-is-it-not-advisable-to-use-attach-in-r-and-what-should-i-use-instead) is more discussion on the subject. – Heroka Dec 22 '15 at 15:15

1 Answers1

1

With measurement error in the dependent variable that is uncorrelated with the independent variables, the estimated coefficients are unbiased but the standard errors are too small. Here's the reference I used (page 1 & 2): http://people.stfx.ca/tleo/econ370term2lec4.pdf

I think you just need to adjust the standard errors from those computed by lm(). And that's what I tried to do in the code below. I'm no stat person so you might want to post to cross validated and ask for better intuition.

For the example below, I assumed the "uncertainty" column is the standard deviation (or the standard error). For simplicity, I changed the model to just: y ~ x.

# initialize dataset
df <- data.frame(
        x = c(2000,2500,3000,3500,4000,4500,5000,5500,6000,6500,7000,7500,8000,8500,9000,9500,10000),
        y = c(0.2084272,0.207078125,0.2054202,0.203488075,0.2013152,0.198933825, 0.196375,0.193668575, 0.1908432, 0.187926325,   0.1849442,   0.181921875, 0.1788832, 0.175850825, 0.1728462,0.169889575,  0.167),
        y.err = c(0.002067834, 0.001037248,  0.001959138, 0.000328942, 0.000646088, 0.001375657, 0.000908696, 0.00014721, 0.000526976, 0.001217318, 0.000556495, 0.000401883, 0.001446992, 0.001235017, 0.001676249, 0.001011735, 0.000326678)
)

df

#  model regression
model <- lm(y~x, data = df)
summary(model)

#  get the variance of the measurement error
#  thanks to: http://schools-wikipedia.org/wp/v/Variance.htm
#  law of total variance
pooled.var.y.err <- mean((df$y.err)^2) + var((df$y.err)^2)

# get variance of beta from model
#   thanks to: http://stats.stackexchange.com/questions/44838/how-are-the-standard-errors-of-coefficients-calculated-in-a-regression
X <- cbind(1, df$x)
#     (if you add more variables to the model you need to modify the following line)
var.betaHat <- anova(model)[[3]][2] * solve(t(X) %*% X) 

# add betaHat variance to measurement error variance
var.betaHat.adj <- var.betaHat + pooled.var.y.err

# calculate adjusted standard errors 
sqrt(diag(var.betaHat.adj))

# compare to un-adjusted standard errors
sqrt(diag(var.betaHat))