Add two columns to a dataframe based on what value appears first

Question

I have a dataframe, df :

df <- structure(list(ID = c("ID1", "ID2", "ID3", "ID4", "ID5", "ID6"
), val1 = c(0, 1, 0, 0, 0, 0), sig1 = c(0.76, 0.15, 0.35, 0.66, 
0.7, 0.72), val2 = c(0.25, 0.81, 0.21, 0.06, 0.18, NA), sig2 = c(0.65, 
0.09, 0.55, 0.88, 0.09, NA), val3 = c(0.26, 0.24, 0.16, -0.1, 
0.22, NA), sig3 = c(0.64, 0.62, 0.65, 0.8, 0.04, NA)), row.names = c(NA, 
6L), class = "data.frame")

> df
   ID val1 sig1 val2 sig2  val3 sig3
1 ID1    0 0.76 0.25 0.65  0.26 0.64
2 ID2    1 0.15 0.81 0.09  0.24 0.62
3 ID3    0 0.35 0.21 0.55  0.16 0.65
4 ID4    0 0.66 0.06 0.88 -0.10 0.80
5 ID5    0 0.70 0.18 0.09  0.22 0.04
6 ID6    0 0.72   NA   NA    NA   NA

This data frame contains 3 pairs of data for each ID number - val1 and sig1, val2 and sig2 and val3 and sig3.

I want to add an additional 2 columns. I want the first column (first_val) to state the name of the first "val" column where the associated "sig" column is less than 0.5. In the second column (first_val_value), I want to paste the associated val1 or val2 value for that ID.

So for example -

For ID1, none of the sig values are less than 0.5, so the first_val column entry should be "-".

For ID2, sig1 is less than 0.5, so the first_val column entry should be "val1".

For ID3, sig1 is less than 0.5, so the first_val column entry should be "val1".

For ID4, none of the sig values are less than 0.5, so the first_val column entry should be "-".

For ID5, sig2 is less than 0.5, so the first_val column entry should be "val2".

For ID6, none of the sig values are less than 0.5, so the first_val column entry should be "-".

So my desired dataframe would look like this:

> df
   ID val1 sig1 val2 sig2  val3 sig3 first_val first_val_value
1 ID1    0 0.76 0.25 0.65  0.26 0.64         -               -
2 ID2    1 0.15 0.81 0.09  0.24 0.62      val1               1
3 ID3    0 0.35 0.21 0.55  0.16 0.65      val1               0
4 ID4    0 0.66 0.06 0.88 -0.10 0.80         -               -
5 ID5    0 0.70 0.18 0.09  0.22 0.04      val2            0.18
6 ID6    0 0.72   NA   NA    NA   NA         -               -

What is the best way to construct this dataframe?

score 1 · Accepted Answer · answered Aug 21 '20 at 11:20

Get the data in long format, filter the values less than 0.5, keep only 1st row for each ID and join the data to get the original data back.

library(dplyr)
df %>%
  tidyr::pivot_longer(cols = -ID, 
               names_to = c('.value', 'first_val'), 
               names_pattern = '([a-z]+)(\\d+)') %>%
  filter(sig < 0.5) %>%
  group_by(ID) %>%
  slice(1L) %>%
  mutate(first_val  = paste0('val', first_val)) %>%
  select(-sig) %>%
  right_join(df, by = 'ID')  %>%
  arrange(ID)
  

#  ID    first_val   val  val1  sig1  val2  sig2  val3  sig3
#  <chr> <chr>     <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#1 ID1   NA        NA        0  0.76  0.25  0.65  0.26  0.64
#2 ID2   val1       1        1  0.15  0.81  0.09  0.24  0.62
#3 ID3   val1       0        0  0.35  0.21  0.55  0.16  0.65
#4 ID4   NA        NA        0  0.66  0.06  0.88 -0.1   0.8 
#5 ID5   val2       0.18     0  0.7   0.18  0.09  0.22  0.04
#6 ID6   NA        NA        0  0.72 NA    NA    NA    NA

Add two columns to a dataframe based on what value appears first

So for example -

1 Answers1