Calculate for all columns in R

Question

I have a data.frame as follows. For every column, I want to assign a value of '1' if the value falls in the range between 2.5 percentile (>=) [second last column] and 97.5 percentile (<=) [last column]. If not, then I want to assign a value 2. At places where there are NAs, I want to let the NAs just be.

> df
              S1    S2    S3    S4    S5    S6    2.5%  97.5%
Gene1         0.02  0.04  0.05  0.03  0.10  0.06  0.01  0.08
Gene1         0.04  0.04  0.04  0.06  0.03  0.04  0.03  0.09
Gene1        51.00 57.00  50.00 54.00 70.00 63.00 41.00 71.00
Gene1         0.46  0.35  0.28  0.41  0.26  0.29  0.21  0.45
Gene1         0.09  0.08  0.09  0.09  0.08  0.09  0.07  0.10
Gene1        46.80 44.60  48.40 45.30 40.90 46.10 36.69 49.20
Gene1           NA    NA  20.30 14.10 12.10 15.50  8.72 25.61
Gene1         1.96  1.05  1.39  1.56  1.54  1.71  1.24  2.00
Gene1         0.53  0.52  0.61  0.83  0.73  0.37  0.36  0.91
Gene1         1.05  0.55  0.85  1.30  1.14  0.64  0.61  1.39
Gene1        22.31 28.88  26.75 25.08 26.29 23.34 19.76 33.44

So in the end, I want a data.frame that looks like this.

> df_updated
              S1  S2  S3  S4  S5  S6
Gene1         1   1   1   1   2   1
Gene1         1   1   1   1   1   1
Gene1         1   1   1   1   1   1
Gene1         2   1   1   1   1   1
Gene1         1   1   1   1   1   1
Gene1         1   1   1   1   1   1
Gene1        NA  NA   1   1   1   1
Gene1         1   2   1   1   1   1
Gene1         1   1   1   1   1   1
Gene1         2   2   1   1   1   1
Gene1         1   1   1   1   1   1

I tried something like follows, but ended up with an error. I learnt from here (Error in if/while (condition) {: missing Value where TRUE/FALSE needed) that this is related to the NAs that I have in the matrix, but I am not sure how to adapt by code for it, and to get the dataframe 'df_updated' I want.

df_updated <- as.data.frame(lapply(df, function(x) if (x>=df$`2.5%` & x<=df$`97.5%`) {x==1} else {x==2}))

Error in if (x >= df$`2.5%` & x <= df$`97.5%`) { : 
  missing value where TRUE/FALSE needed
In addition: There were 50 or more warnings (use warnings() to see the first 50)

Any help appreciated. Thank you.

Are you sure you have a `data.frame`? That looks like it might be a `matrix` given that you have repeating row names. Are you able to `dput(df)` here so we know exactly what you are working with? — thelatemail, Jul 02 '19 at 10:20

score 4 · Answer 1 · answered Jul 02 '19 at 10:30

To test if your values are in the quantile ranges you can use ifelse inside apply like:

df <- read.table(header=TRUE, text="
x             S1    S2    S3    S4    S5    S6    x2.5  x97.5
Gene1         0.02  0.04  0.05  0.03  0.10  0.06  0.01  0.08
Gene1         0.04  0.04  0.04  0.06  0.03  0.04  0.03  0.09
Gene1        51.00 57.00  50.00 54.00 70.00 63.00 41.00 71.00
Gene1         0.46  0.35  0.28  0.41  0.26  0.29  0.21  0.45
Gene1         0.09  0.08  0.09  0.09  0.08  0.09  0.07  0.10
Gene1        46.80 44.60  48.40 45.30 40.90 46.10 36.69 49.20
Gene1           NA    NA  20.30 14.10 12.10 15.50  8.72 25.61
Gene1         1.96  1.05  1.39  1.56  1.54  1.71  1.24  2.00
Gene1         0.53  0.52  0.61  0.83  0.73  0.37  0.36  0.91
Gene1         1.05  0.55  0.85  1.30  1.14  0.64  0.61  1.39
Gene1        22.31 28.88  26.75 25.08 26.29 23.34 19.76 33.44")

t(apply(df[-1], 1, function(x) ifelse(x>=x[length(x)-1] & x<=x[length(x)], 1, 2)))[,1:6]
#      S1 S2 S3 S4 S5 S6
# [1,]  1  1  1  1  2  1
# [2,]  1  1  1  1  1  1
# [3,]  1  1  1  1  1  1
# [4,]  2  1  1  1  1  1
# [5,]  1  1  1  1  1  1
# [6,]  1  1  1  1  1  1
# [7,] NA NA  1  1  1  1
# [8,]  1  2  1  1  1  1
# [9,]  1  1  1  1  1  1
#[10,]  1  2  1  1  1  1
#[11,]  1  1  1  1  1  1

Ronak Shah · Accepted Answer · 2019-07-02T10:43:19.273

3

Don't think you'll require sapply/lapply or any kind of loop here. You could directly compare the values in the dataframe with a column. You get your expected output except for FALSE values which could be converted using replace.

cols <- c(ncol(df) - 1, ncol(df))
df_updated <- df[-cols] >= df[[cols[1]]] & df[-cols] <= df[[cols[2]]]
df_updated <- replace(df_updated, !df_updated, 2)

#      S1 S2 S3 S4 S5 S6
# [1,]  1  1  1  1  2  1
# [2,]  1  1  1  1  1  1
# [3,]  1  1  1  1  1  1
# [4,]  2  1  1  1  1  1
# [5,]  1  1  1  1  1  1
# [6,]  1  1  1  1  1  1
# [7,] NA NA  1  1  1  1
# [8,]  1  2  1  1  1  1
# [9,]  1  1  1  1  1  1
#[10,]  1  2  1  1  1  1
#[11,]  1  1  1  1  1  1

Also if you need final output as dataframe you can wrap it in data.frame() in the end.

A smart one-liner suggested by @thelatemail is

(df[-cols] < df[["2.5%"]] | df[-cols] > df[["97.5%"]]) + 1

edited Jul 02 '19 at 10:43

answered Jul 02 '19 at 10:27

Ronak Shah

377,200
20
156
213

2

Too quick for me - here's my variation of similar logic 2 minutes too slow - `vars <- paste0("S",1:6); (dat[vars] < dat[["2.5%"]] | dat[vars] > dat[["97.5%"]]) + 1` – thelatemail Jul 02 '19 at 10:30
@thelatemail that is smart. I was trying to squeeze it one line but wasn't successful. – Ronak Shah Jul 02 '19 at 10:38
I had a few attempts before I simplified ;-) Needing the 2/1 order means doing the opposite comparison `< | >` rather than `>= & <=`, but the core idea hopefully is still clear. – thelatemail Jul 02 '19 at 10:42

Calculate for all columns in R

2 Answers2