Calculate 95 % credible interval for multiple columns in data frame R

Question

I need to get/calculate the 95 % credible interval for my data. My data consists of ten columns and over 5000 rows. Here is some example data.

data <- data.frame(A = c(-7.595932, -6.451768, -4.682111, -8.781488, -4.251690), 
                   B = c(0.8324450, 0.9451657, 0.8773759, 0.6044753, 0.6553995),
                   C = c(22.747480, 15.477470, 18.745407, 9.622865, 21.137619), 
                   D = c(-11.684762, -13.474299, -9.783277, -7.747501, -12.352081))

I am just not sure which function to use since I get different results each time and it only works with one column at a time. I have tried the following functions:

ci(data$`A`, confidence = 0.95)  ## R package gmodels

and

CI(data$`A`, confidence = 0.95) ##R package Rmisc

Have anyone else experienced the same problem?

Can you please provide a reproducible example? The functions `ci()` and `CI()` are not in base R, so I get an error message. — sjp, Dec 17 '20 at 10:37
@sjp , ci() is with package gmodels and CI() is with package Rmisc. I am not sure what you mean with reproducible example? Still quite new to R:( — paula456, Dec 17 '20 at 10:49

score 2 · Answer 1 · edited Dec 17 '20 at 11:27

2

The two functions give you actually the same thing:

library(gmodels)
library(Rmisc)
 CI(data$A)
    upper      mean     lower 
-3.975568 -6.352598 -8.729627 
 ci(data$A, confidence = 0.95)
  Estimate   CI lower   CI upper Std. Error 
-6.3525978 -8.7296274 -3.9755682  0.8561414

To apply it on all columns, use lapply or sapply:

> sapply(data,CI)
              A         B        C          D
upper -3.975568 0.9648266 24.01143  -8.198957
mean  -6.352598 0.7829723 17.54617 -11.008384
lower -8.729627 0.6011180 11.08091 -13.817811

edited Dec 17 '20 at 11:27

jay.sf

60,139
8
53
110

answered Dec 17 '20 at 10:57

denis

5,580
1
13
40

I must have done something wrong then. But how can I get credible interval for all columns, not just one (in this case A) @denis – paula456 Dec 17 '20 at 10:59

score 1 · Answer 2 · answered Dec 17 '20 at 10:55

It's not clear if this is what you are looking for, but you can get a print-out of the mean of each variable with the 95% confidence interval for the mean like this:

lapply(data, function(x) {
   paste0(round(mean(x), 2), " (95% CI: ",
   paste(round(sort(mean(x) + c(1.96, -1.96) * sd(x)/sqrt(length(x))), 2),
         collapse = " to "), ")")
 } )

#> $A
#> [1] "-6.35 (95% CI: -8.03 to -4.67)"
#>
#> $B
#> [1] "0.78 (95% CI: 0.65 to 0.91)"
#>
#> $C
#> [1] "17.55 (95% CI: 12.98 to 22.11)"
#>
#> $D
#> [1] "-11.01 (95% CI: -12.99 to -9.03)"

score 1 · Accepted Answer · answered Dec 17 '20 at 10:58

If you want a credible interval (from Bayesian statistics) this requires some additional tuning, choice of prior and likelihood. There are some defaults already in some functions, so you may get away with it, but you should really know what you are doing, before blindly applying such concepts. Here is an example for demonstration purposes.

library(bayestestR)

data <- data.frame(A = c(-7.595932, -6.451768, -4.682111, -8.781488, -4.251690), 
                   B = c(0.8324450, 0.9451657, 0.8773759, 0.6044753, 0.6553995),
                   C = c(22.747480, 15.477470, 18.745407, 9.622865, 21.137619), 
                   D = c(-11.684762, -13.474299, -9.783277, -7.747501, -12.352081))

sapply(data,ci,ci=0.95)

        A         B         C        D        
CI      95        95        95       95       
CI_low  -8.662932 0.6095677 10.20833 -13.36208
CI_high -4.294732 0.9383867 22.58649 -7.951079

The data I have is my posterior distribution, transferred from a stanfit object to a data frame. My job here is to decide with parameters (A, B, C, etc) I would like to keep and which I would like to remove from the model. I would like to remove parameters whose 95% credible interval include 0. — paula456, Dec 17 '20 at 13:27
@paula456 If you have the posterior than it's ok, it's just a matter of finding the borders. Your second point, however, about removing variables which are not "significant", is not recommended, because it is a form of overfitting. This is more appropriate for cross validated. — user2974951, Dec 17 '20 at 13:37

Calculate 95 % credible interval for multiple columns in data frame R

3 Answers3