3

It is a follow-up question. When I run the code given below, I get a plot with two R2 and p-value but p-value = 0. It is possibly due to very small p-value. I tried increasing no. of digits to 20 (here signif(..p.value.., digits = 4)) but it did not work. I would rather prefer to mention exact p-value or use stars e.g., if (p<0.002) star='**' else if (p>=0.002&p<0.05) star='*' else star=''. In addition, I would like to have r values listed in the plot. Have a look and please let me know which part needs to be amended. Looking forward!

UPDATE

Answer codes by @eipi10 for adding p-value work, but still seeking answers on adding correlation coefficient (r) in the ggplots.

Code:

library(dplyr) 
library(ggplot2)
library(ggpmisc)

df <- diamonds %>%
  dplyr::filter(cut%in%c("Fair","Ideal")) %>%
  dplyr::filter(clarity%in%c("I1" ,  "SI2" , "SI1" , "VS2" , "VS1",  "VVS2")) %>%
  dplyr::mutate(new_price = ifelse(cut == "Fair", 
                                   price* 0.5, 
                                   price * 1.1))

formula <- y ~ x - 1

p <- ggplot(df, aes(x,y, color=factor(cut))) 
p <- p + stat_smooth(method = "lm", formula = y ~ x-1, size = 1, level=0.95) 
p <- p + geom_point(alpha = 0.3) 
p <- p + stat_poly_eq(aes(label = paste(..rr.label..)),
                      label.x.npc = "right", label.y.npc = 0.15, formula = formula, 
                      parse = TRUE, size = 3) + 
          stat_fit_glance(method = 'lm', method.args = list(formula = formula),
                      geom = 'text', aes(label = paste("P-value = ", 
                      signif(..p.value.., digits = 4), sep = "")),label.x.npc = 'right',
                      label.y.npc = 0.35, size = 3)
print(p)

enter image description here

Community
  • 1
  • 1
Aby
  • 167
  • 1
  • 3
  • 16

1 Answers1

4

This is a large data set and you can see from the graph that the fit is almost perfect, which means the p-value for the regression is going to be miniscule. Below are regression models for each of the two levels of cut. To save space, only the key parts of the model summaries are shown:

lapply(unique(df$cut), function(g) {
  summary(lm(y ~ x - 1, df %>% filter(cut==g)))
})
cut=="Ideal"
...
Coefficients:
  Estimate Std. Error t value Pr(>|t|)    
x 1.001715   0.000269    3724   <2e-16 ***
...
Residual standard error: 0.2079 on 18291 degrees of freedom
Multiple R-squared:  0.9987,  Adjusted R-squared:  0.9987 
F-statistic: 1.387e+07 on 1 and 18291 DF,  p-value: < 2.2e-16

cut=="Fair"
...
Coefficients:
   Estimate Std. Error t value Pr(>|t|)    
x 0.9895032  0.0004096    2416   <2e-16 ***
...
Residual standard error: 0.1033 on 1583 degrees of freedom
Multiple R-squared:  0.9997,  Adjusted R-squared:  0.9997 
F-statistic: 5.836e+06 on 1 and 1583 DF,  p-value: < 2.2e-16

Note the enormous F statistics. The p-values for such large F statistics are essentially zero.

pf(5.836e06, 1, 1583, lower=FALSE)  
[1] 0

Any F statistic above about 2,400 (for the given degrees of freedom) will give a p-value below the smallest non-zero number that R can represent.

pf(2400, 1, 1583, lower=FALSE)
[1] 1.716433e-319

By default, when R rounds a number, it doesn't return trailing zeros (try round(1.340000, 5) or signif(1.340000,5)). To get more zeros printed, you can, for example, use sprintf to format the output string. The code below formats the p-value in scientific notation. For decimal numbers replace "%1.4e" with "%1.4f". See the help for more details on format strings:

p <- ggplot(df, aes(x,y, color=cut)) + 
  stat_smooth(method = "lm", formula = y ~ x-1, size = 1, level=0.95) + 
  geom_point(alpha = 0.3) +
  stat_poly_eq(aes(label = paste(..rr.label..)),
               label.x.npc = "right", label.y.npc = 0.15, formula = formula, 
               parse=TRUE, size = 3) + 
  stat_fit_glance(method = 'lm', method.args = list(formula = formula),
                  geom='text', aes(label=paste0("P-value = ", sprintf("%1.4e", ..p.value..))),
                  label.x.npc = 'right',
                  label.y.npc = 0.4, size = 3)

enter image description here

UPDATE: To add starred p-value ranges, one option is to use ifelse statements with p-value ranges as the conditions:

p <- ggplot(df, aes(x,y, color=cut)) + 
  stat_smooth(method = "lm", formula = y ~ x-1, size = 1, level=0.95) + 
  geom_point(alpha = 0.3) +
  stat_poly_eq(aes(label = paste(..rr.label..)),
               label.x.npc = "right", label.y.npc = 0.15, formula = formula, 
               parse=TRUE, size = 3) + 
  stat_fit_glance(method = 'lm', method.args = list(formula = formula),
                  geom='text', aes(label=ifelse(..p.value..< 0.001, "p<0.001**", 
                                                ifelse(..p.value..>=0.001 & ..p.value..<0.05, "p<0.05*", "p>0.05"))),
                  label.x.npc = 'right',
                  label.y.npc = 0.4, size = 3)

enter image description here

eipi10
  • 91,525
  • 24
  • 209
  • 285
  • You can get the p-value directly on the log(10) scale if you really need it: `pf(5.836e06, 1, 1583, lower=FALSE, log.p=TRUE)/log(10)` = -2824.782 – Ben Bolker Sep 05 '16 at 19:41
  • 1
    Thanks @BenBolker. `if you really need it:`: You mean, in case I was worried about whether my p-value was really 1e-2800 instead of the stupendously greater 1e-315? Maybe this is a good time to link to [this article](http://www.stat.columbia.edu/~gelman/research/published/signif4.pdf). – eipi10 Sep 05 '16 at 19:48
  • Thanks @eipi10 & @BenBolker. I would prefer to use stars or mention exact p-value. e.g., `if (p<0.002) star='**' else if (p>=0.002&p<0.05) star='*' else star=''`. Any thoughts on adding correlation coefficient and stars? Ben, Where/how to incorporate your log(10) scale code in @eipi10 code? – Aby Sep 06 '16 at 10:23