1

I am using plyr::ddply to run a regression model

model <- rating ~ A + B + C + D + E + F

by the factor resp.id. I can create a data frame of the betas by each factor with:

indiv.betas <- ddply(data.coded, "resp.id",
                     function(df) coef(lm(model, data=df)))

I am now trying to extract the p-values for the variables by the factor using:

indiv.pvalues <- ddply(data.coded, "resp.id",
                       function(df) coef(summary(lm(model, data=df)))[, "Pr(>|t|)"])

Unfortunately, it just gives me a data frame with NaN.

Although, if I run a model across the entire data set, I can extract p-values from this one model as a data frame successfully with:

pvalue <- as.data.frame(coef(summary(lm(model, data=df)))[, "Pr(>|t|)"])

How can I create a data frame of the p-values by the factor?

Thanks.

Zheyuan Li
  • 71,365
  • 17
  • 180
  • 248
RTrain3k
  • 845
  • 1
  • 13
  • 27
  • You should include a [reproducible example](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) with sample input data so we can run the code to test it to see what's going on. – MrFlick Sep 16 '16 at 15:54

1 Answers1

2

When you fit a single model

rating ~ A + B + C + D + E + F

you get meaningful, non-NA result. While when you fit the same model for each subset / factor level by resp.id, you get NaN result. I am 100% sure that for some factor level, you don't have enough data to fit the above model. It would be a good idea, to first check how many data there are for each group. You can use:

N <- with(data.coded, tapply(rating, resp.id, FUN = length))

Your model has 7 coefficients (1 for intercept and 1 each for A, B, ..., F). So which(N < 7) will tell you which factor levels are producing NaN.


In this part, I will show that I am not able to reproduce your problem with iris dataset.

library(plyr)

model <- Sepal.Length ~ Sepal.Width + Petal.Length + Petal.Width

ddply(iris, "Species", function(df) coefficients(lm(model, data=df)))

#     Species (Intercept) Sepal.Width Petal.Length Petal.Width
#1     setosa    2.351890   0.6548350    0.2375602   0.2521257
#2 versicolor    1.895540   0.3868576    0.9083370  -0.6792238
#3  virginica    0.699883   0.3303370    0.9455356  -0.1697527

ddply(iris, "Species", function(df) coef(summary(lm(model, data=df)))[, 4])

#     Species  (Intercept)  Sepal.Width Petal.Length Petal.Width
#1     setosa 3.034183e-07 6.834434e-09 2.593594e-01    0.470987
#2 versicolor 5.112246e-04 6.488965e-02 1.666695e-06    0.125599
#3  virginica 1.961563e-01 6.439972e-02 1.074269e-13    0.395875

In this part, I will show why NaN could appear when there are more coefficients than data.

set.seed(0);
x1 <- rnorm(3); x2 <- rnorm(3); x3 <- rnorm(3)
y <- rnorm(3)

fit <- lm(y ~ x1 + x2 + x3)  ## 3 data, 4 coefficients

coef(summary(fit))
#             Estimate Std. Error t value Pr(>|t|)
#(Intercept) 0.4217653        NaN     NaN      NaN
#x1          0.4124869        NaN     NaN      NaN
#x2          1.1489330        NaN     NaN      NaN
Zheyuan Li
  • 71,365
  • 17
  • 180
  • 248
  • Regarding the last part, if you run `fit <- lm(y ~ x1 + x2) ## 3 data, 3 coefficients` you still get the `NaN`. How would you have to change it? Thank you – InesGuardans May 22 '20 at 07:26