2

I have a dataset that has an ID variable and thousands of columns of averages. A reproducible example is below. For each ID, I would like to select the column name that contains the value closest to 0.50. If there is a tie, select the lowest value. Is there an efficient way to do this (preferably using dplyr or data.table)?

df = data.frame(ID = paste("ID", 1:1000, sep = ""),
                matrix(rnorm(20000), nrow=10))

> df[1:5, 1:5]

   ID         X1          X2          X3          X4
1 ID1 -0.5532944 -1.20671805  0.75142048  0.56022595
2 ID2 -1.0083010 -0.01534611  1.53546691 -0.08762588
3 ID3 -0.1606776 -0.96947669 -0.38631278 -1.15647134
4 ID4 -0.5957471 -0.20918120 -0.05246698 -0.84235789
5 ID5  0.1569595 -0.62460245 -0.39454014  0.91089249

My goal is to have a dataframe with the ID variable and the column name that contains the value closest to 0.5 as well as the value.

   ID    T      P
1 ID1  X10 0.5671
2 ID2 X100 0.4999
3 ID3  X34 0.5877
4 ID4  X21 0.5055
5 ID5  X15 0.4987
Henrik
  • 65,555
  • 14
  • 143
  • 159
Kate N
  • 423
  • 3
  • 14
  • 1
    Related: [For each row return the column name of the largest value](https://stackoverflow.com/questions/17735859/for-each-row-return-the-column-name-of-the-largest-value/17735894#17735894). Start by grabbing the columns with least deviation from 0.5, `ci = max.col(-abs(d[ , -1] - 0.5))`. Then `data.frame(id = d[ , 1], nm = names(d[ , -1])[ci], val = d[ , -1][cbind(seq_len(nrow(d)), ci)])` – Henrik Jun 15 '20 at 20:13
  • 1
    Please, use `set.seed()` to make random numbers reproducible. Thank you. – Uwe Jun 16 '20 at 06:25

4 Answers4

3

Here is a different approach which reshapes the dataset from wide to long format using melt().

# create sample data: ID has constant length, values are rounded to 3 digits
set.seed(2020)
df = data.frame(ID = sprintf("ID%04i", 1:1000),
                matrix(round(rnorm(20000), 3), nrow=10))
target <- 0.5

library(data.table)
long <- melt(setDT(df), "ID")
long[, .SD[which.min(abs(value - target))], by = ID]
          ID variable value
   1: ID0001    X1924 0.501
   2: ID0002    X1440 0.499
   3: ID0003     X906 0.500
   4: ID0004     X180 0.503
   5: ID0005    X1757 0.498
  ---                      
 996: ID0996    X1568 0.500
 997: ID0997     X565 0.501
 998: ID0998     X613 0.502
 999: ID0999    X1344 0.500
1000: ID1000    X1018 0.501

Now, the OP has requested to select the lower value in case of ties. This can be achieved by ordering:

long[order(ID, value), .SD[which.min(abs(value - target))], by = ID]
          ID variable value
   1: ID0001    X1924 0.501
   2: ID0002    X1440 0.499
   3: ID0003     X906 0.500
   4: ID0004     X180 0.503
   5: ID0005    X1757 0.498
  ---                      
 996: ID0996    X1568 0.500
 997: ID0997     X565 0.501
 998: ID0998     X613 0.502
 999: ID0999    X1344 0.500
1000: ID1000    X1971 0.499

Note the difference in row 1000.

By chaining the data.table expressions the statement can be written as a "one-liner":

melt(setDT(df), "ID")[order(ID, value), .SD[which.min(abs(value - target))], by = ID]

Also note that the sample dataset has been modified

  1. set.seed() is used to ensure that the generated random numbers are reproducible.
  2. By using sprintf("ID%04i", 1:1000) instead of paste(), ID has a fixed length. This helps to maintain a consistent sort order.
  3. The random numbers are rounded to 3 digits to make it more likely to encounter ties.
Uwe
  • 41,420
  • 11
  • 90
  • 134
  • Different chaining, arguably more readable: `setDF(df);df[, melt(.SD, "ID")][order(ID, value), .SD[which.min(abs(value - target))], by = ID]` – s_baldur Jun 16 '20 at 09:11
  • @sindri_baldur, You can remove `;df` from the chain: `setDT(df)[, melt(.SD, "ID")][order(ID, value), .SD[which.min(abs(value - target))], by = ID]` but this slightly slower than `melt(setDT(df), "ID")[order(ID, value), .SD[which.min(abs(value - target))], by = ID]`. – Uwe Jun 16 '20 at 09:15
1

Base R solution which should always select the lower value in the event of a tie:

num_cols_idx <- which(sapply(df, is.numeric))
min_vec <- sapply(split(df, rownames(df)), function(x) {
  sorted_named_vec <- sort(unlist(x[num_cols_idx]))
  names(sorted_named_vec)[which.min(abs(sorted_named_vec - 0.5))]
  }, 
simplify = TRUE)
hello_friend
  • 5,682
  • 1
  • 11
  • 15
0

I've simplified the example code to the smaller subset to spare my processor as worked through the code:

data.frame(
    ID = df[1:5,1],
    T = apply(df[1:5, 2:5],1, function(x) colnames(df)[which.min(abs(x - 0.5))]),
    P = apply(df[1:5, 2:5],1, function(x) x[which.min(abs(x - 0.5))])
)
Daniel O
  • 4,258
  • 6
  • 20
  • If there is a tie between column values will this select the lower value? – Kate N Jun 15 '20 at 19:45
  • `min.which` blindly selects the first min it see's, so whichever min is located closest to the head of the vector. The only way to trick it would be to first order the column from lowest to highest – Daniel O Jun 15 '20 at 19:47
0

I think this is what you want. Because your play data repeats itself I would check on some other more random data. Yes it uses the first "X" variable it finds which is what I assume you mean by "lowest.

library(dplyr)

set.seed(2020)

df <- data.frame(ID = paste("ID", 1:1000, sep = ""),
                matrix(rnorm(20000), nrow=10))




results <- df %>%
  rowwise %>%
  summarise(ID = ID, 
            col_index = which.min(abs(c_across(X1:X2000) - 0.5)) + 1,
            whichcolumn = colnames(.[col_index]),
            value = nth(c_across(X1:X2000), which.min(abs(c_across(X1:X2000) - .5))),
            .groups = "rowwise")

results
#> # A tibble: 1,000 x 4
#> # Rowwise: 
#>    ID    col_index whichcolumn value
#>    <chr>     <dbl> <chr>       <dbl>
#>  1 ID1        1925 X1924       0.501
#>  2 ID2        1441 X1440       0.499
#>  3 ID3         907 X906        0.500
#>  4 ID4         181 X180        0.503
#>  5 ID5        1758 X1757       0.498
#>  6 ID6        1569 X1568       0.500
#>  7 ID7         566 X565        0.501
#>  8 ID8        1448 X1447       0.502
#>  9 ID9        1345 X1344       0.500
#> 10 ID10       1019 X1018       0.501
#> # … with 990 more rows
Chuck P
  • 3,862
  • 3
  • 9
  • 20
  • I had trouble with the c_across function. I'm not sure why but R couldn't find it. Is it part of dplyr? – Kate N Jun 16 '20 at 15:27