-1

I have a dataframe regarding kidney transplant patients with different clinical outcomes (numbers changed for confidentiality purposes. In other words I have something like this.

Patient        eGFR1m cr1m  alb1m  cr3m   eGFR3m   alb3m  cr12m eGFR12m   Diseased
A              142    343     125   110     115     125     120   181        1
B              175    192     121   125     215     120     135   151        0
C              154    185     128   210     115     125     124   116        0  
D              170    215     215   110     125     110     145   205        1 
E              175    140     225   110     115     110     125   120        0  

This is the simplified version. I have a lot more outcomes so I want to create a loop for calculating median and IQR for each column in R.

Another thing is that I need the medians for the cohort, as well as medians for a diseased group and non-diseased group as comparisons. The disease outcome was collected as binary, non-continous variable. eGFR, cr, alb at each month are all continous, non-parametric variables.

MrFlick
  • 195,160
  • 17
  • 277
  • 295
han
  • 17
  • 4
  • Just to confirm. Do you want the median of all patients (median of all samples). Or the median per patient. (median of a,b,c per patient) – Lex Sep 08 '20 at 03:40
  • SO is not a free coding service; post your data using `dput` so we do not have to type in all in. Then share the essence of the code you already worked on and raise specific issues you may have. See [here](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example.) on how to make a reproducible example. Google for instance `R data analysis with dplyr` and you will find many useful examples that would help you with your problem. – Paul van Oppen Sep 08 '20 at 03:42
  • 1
    Please ask only one question per post. Also include the code that you have tried and got stuck or didn't work. – Ronak Shah Sep 08 '20 at 04:05

2 Answers2

1

it seems you want us to do all the steps of an initial exploratory data analysis for you. On your next postings, instead of requesting coding like this, you should first show your problems with reproducible code, show the results of your attempts, and ask specific questions about your doubts. That said, lets look at your question:

You can use apply in loops to return median, mean, Q1 and Q3 for every column.

sapply(yourdataframe, median) #will return a vector with the medians of every column

Similarly,

sapply(yourdataframe, quantile, 0.25) #will return a vector with all the first quartiles
sapply(yourdataframe, quantile, 0.75) #will return a vector with all the third quartiles

You may want to write a function that integrates all that in a single call, like this:


descriptive<-function(x=data.frame(), digits=2, na.rm=TRUE, normality_test="shapiro"){
        library(stats)
        is.normal<-character()
        medians<-numeric()
        Q1<-numeric()
        Q3<-numeric()
        means<-numeric()
        SDs<-numeric()
        output<-character()
        for (i in seq_along(x)){
                if (is.numeric(x[,i])){
                        medians[i]<-median(x[,i], na.rm = na.rm)
                        Q1[i]<-quantile(x[,i], 0.25, na.rm = na.rm)
                        Q3[i]<-quantile(x[,i], 0.75, na.rm = na.rm)
                        means[i]<-round(mean(x[,i], na.rm = na.rm), digits = digits)
                        SDs[i]<-round(sd(x[,i], na.rm=TRUE), digits = digits)
                        if (normality_test=="shapiro"){
                                p.value<-shapiro.test(x[,i])$p.value
                        } else if (normality_test=="ks"){
                                p.value<-ks.test(x[,i], "pnorm", means[i], SDs[i])$p.value
                        }
                        if (p.value<=0.05){
                                is.normal[i]<-FALSE
                                output[i]<-paste0(medians[i], " (", Q1[i], "-", Q3[i], ")")
                        }else{
                                is.normal[i]<-TRUE
                                output[i]<-paste0(means[i], " +-", SDs[i])
                        }
                }else  {
                        is.normal[i]<-NA
                        means[i]<-NA
                        medians[i]<-NA
                        Q1[i]<-NA
                        Q3[i]<-NA
                        SDs[i]<-NA
                        output[i]<-NA
                }
        }      
        
        df<-data.frame(rbind( "normal distr"=is.normal, "median"=medians, "Q1"=Q1, "Q3"=Q3, "mean"=means, "SD"=SDs, "output"=output))
        names(df)<-colnames(x)
        df
}

As an example:

> descriptive(iris, normality_test="shapiro")
              Sepal.Length Sepal.Width   Petal.Length   Petal.Width Species
normal distr         FALSE        TRUE          FALSE         FALSE    <NA>
median                 5.8           3           4.35           1.3    <NA>
Q1                     5.1         2.8            1.6           0.3    <NA>
Q3                     6.4         3.3            5.1           1.8    <NA>
mean                  5.84        3.06           3.76           1.2    <NA>
SD                    0.83        0.44           1.77          0.76    <NA>
output       5.8 (5.1-6.4) 3.06 +-0.44 4.35 (1.6-5.1) 1.3 (0.3-1.8)    <NA>

There are several ways to subset your data based on categorical values for analysis, check dplyr's filter and group_by functions.

GuedesBF
  • 8,409
  • 5
  • 19
  • 37
  • Ah, my bad. I'm very new to coding and this forum. Thank you for your help! I'm such a noob, didn't even know sapply could be used so i've been literally coding median(dataframe$variable) and copying pasting for each variable. Yeah, i know repetition is bad in coding. I'll try using the function. – han Sep 08 '20 at 08:00
  • Hi, @han , glad to help. You can ACCEPT an answer if you feel it resolved your issues, with the "accept answer button" (upper left corner of the answer). Accepting and upvoting answers is usually better than thanking in the comment section. – GuedesBF Sep 09 '20 at 14:24
0

Try the following code. Note that I have not consider the last column(Diseased) since median and IQR wouldn't make sense for a discrete variable.

# creating your data

data = matrix (c(142,343,125,110,115,125,120,181,1,
  175,192,121,125,215,120,135,151,0,
  154,185,128,210,115,125,124,116,0,  
  170,215,215,110,125,110,145,205,1, 
  175,140,225,110,115,110,125,120,0), ncol=9, byrow = TRUE)

colnames(data) <- c('eGFR1m', 'cr1m' , 'alb1m'  ,'cr3m' ,  'eGFR3m' ,  'alb3m' , 'cr12m' ,'eGFR12m',   'Diseased')
rownames(data) <- LETTERS[1: nrow(data)]

# IQR and median for each column

apply(data[, -ncol(data)], 2, function(x){
  Median = median(x, na.rm = TRUE)
  IQR = IQR(x, na.rm = TRUE)
  c(Median = Median, IQR = IQR)
})

Liman
  • 1,270
  • 6
  • 12