0

I need to get/calculate the 95 % credible interval for my data. My data consists of ten columns and over 5000 rows. Here is some example data.

data <- data.frame(A = c(-7.595932, -6.451768, -4.682111, -8.781488, -4.251690), 
                   B = c(0.8324450, 0.9451657, 0.8773759, 0.6044753, 0.6553995),
                   C = c(22.747480, 15.477470, 18.745407, 9.622865, 21.137619), 
                   D = c(-11.684762, -13.474299, -9.783277, -7.747501, -12.352081))

I am just not sure which function to use since I get different results each time and it only works with one column at a time. I have tried the following functions:

ci(data$`A`, confidence = 0.95)  ## R package gmodels

and

CI(data$`A`, confidence = 0.95) ##R package Rmisc

Have anyone else experienced the same problem?

paula456
  • 101
  • 9

3 Answers3

2

The two functions give you actually the same thing:

library(gmodels)
library(Rmisc)
 CI(data$A)
    upper      mean     lower 
-3.975568 -6.352598 -8.729627 
 ci(data$A, confidence = 0.95)
  Estimate   CI lower   CI upper Std. Error 
-6.3525978 -8.7296274 -3.9755682  0.8561414 

To apply it on all columns, use lapply or sapply:

> sapply(data,CI)
              A         B        C          D
upper -3.975568 0.9648266 24.01143  -8.198957
mean  -6.352598 0.7829723 17.54617 -11.008384
lower -8.729627 0.6011180 11.08091 -13.817811
jay.sf
  • 60,139
  • 8
  • 53
  • 110
denis
  • 5,580
  • 1
  • 13
  • 40
  • I must have done something wrong then. But how can I get credible interval for all columns, not just one (in this case A) @denis – paula456 Dec 17 '20 at 10:59
1

It's not clear if this is what you are looking for, but you can get a print-out of the mean of each variable with the 95% confidence interval for the mean like this:

lapply(data, function(x) {
   paste0(round(mean(x), 2), " (95% CI: ",
   paste(round(sort(mean(x) + c(1.96, -1.96) * sd(x)/sqrt(length(x))), 2),
         collapse = " to "), ")")
 } )

#> $A
#> [1] "-6.35 (95% CI: -8.03 to -4.67)"
#>
#> $B
#> [1] "0.78 (95% CI: 0.65 to 0.91)"
#>
#> $C
#> [1] "17.55 (95% CI: 12.98 to 22.11)"
#>
#> $D
#> [1] "-11.01 (95% CI: -12.99 to -9.03)"
Allan Cameron
  • 147,086
  • 7
  • 49
  • 87
1

If you want a credible interval (from Bayesian statistics) this requires some additional tuning, choice of prior and likelihood. There are some defaults already in some functions, so you may get away with it, but you should really know what you are doing, before blindly applying such concepts. Here is an example for demonstration purposes.

library(bayestestR)

data <- data.frame(A = c(-7.595932, -6.451768, -4.682111, -8.781488, -4.251690), 
                   B = c(0.8324450, 0.9451657, 0.8773759, 0.6044753, 0.6553995),
                   C = c(22.747480, 15.477470, 18.745407, 9.622865, 21.137619), 
                   D = c(-11.684762, -13.474299, -9.783277, -7.747501, -12.352081))

sapply(data,ci,ci=0.95)

        A         B         C        D        
CI      95        95        95       95       
CI_low  -8.662932 0.6095677 10.20833 -13.36208
CI_high -4.294732 0.9383867 22.58649 -7.951079
user2974951
  • 9,535
  • 1
  • 17
  • 24
  • The data I have is my posterior distribution, transferred from a stanfit object to a data frame. My job here is to decide with parameters (A, B, C, etc) I would like to keep and which I would like to remove from the model. I would like to remove parameters whose 95% credible interval include 0. – paula456 Dec 17 '20 at 13:27
  • @paula456 If you have the posterior than it's ok, it's just a matter of finding the borders. Your second point, however, about removing variables which are not "significant", is not recommended, because it is a form of overfitting. This is more appropriate for cross validated. – user2974951 Dec 17 '20 at 13:37