1

If I have the following data structure in my data frame df:

a  b  c  d

1  2  3  4
NA NA 1  2
NA 1  2  NA
NA NA NA 1

how can I create a variable that tells me, which of the variables is the first one to not have a missing value, such that:

a  b  c  d  var

1  2  3  4  a
NA NA 1  2  c
NA 1  2  NA b
NA NA NA 1  d

I need the code to work with variable names and not column numbers, because of the size of the dataset and changing the order of the variables.

I have tried:

df <- df %>% mutate(var = coalesce(deparse(substitute(a)), deparse(substitute(b)), deparse(substitute(c)), deparse(substitute(d))))

and

df <- df %>% mutate(var = deparse(substitute(do.call(coalesce, across(c(a, b, c, d))))))

trying to implement this approach. I got the code to extract the string of a variable name from: How to convert variable (object) name into String

Anton
  • 254
  • 1
  • 9

2 Answers2

1

You can do

df %>% mutate(var = apply(., 1, \(x) names(which(!is.na(x)))[1]))
#>    a  b  c  d var
#> 1  1  2  3  4   a
#> 2 NA NA  1  2   c
#> 3 NA  1  2 NA   b
#> 4 NA NA NA  1   d
Allan Cameron
  • 147,086
  • 7
  • 49
  • 87
  • How can I adapt this code if I want to change the order of the variables that are going to be checked for missing values, without changing the structure of the data frame? – Anton May 20 '22 at 09:41
  • You would change the order of the columns in `select` first. – Allan Cameron May 20 '22 at 09:55
1

We can use max.col, i.e.

names(df)[max.col(!is.na(df), ties.method = 'first')]
#[1] "a" "c" "b" "d"

Via dplyr,

library(dplyr)

df %>% 
 mutate(var = names(.)[max.col(!is.na(.), ties.method = 'first')])

   a  b  c  d var
1  1  2  3  4   a
2 NA NA  1  2   c
3 NA  1  2 NA   b
4 NA NA NA  1   d

You can specify variables

df %>% 
 mutate(var = names(.[c('a', 'b')])[max.col(!is.na(.[c('a', 'b')]), ties.method = 'first')])
   a  b  c  d var
1  1  2  3  4   a
2 NA NA  1  2   a
3 NA  1  2 NA   b
4 NA NA NA  1   a

You can also change the order of the variables via select(), i.e.

df %>% 
 select(c, d, b, a) %>%
 mutate(new = names(.)[max.col(!is.na(.), ties.method = 'first')])

   c  d  b  a new
1  3  4  2  1   c
2  1  2 NA NA   c
3  2 NA  1 NA   c
4 NA  1 NA NA   d

You can also call select() again to restore the original order of columns but maintain the result from previous order, i.e.

df %>% 
 select(c, d, b, a) %>%
 mutate(new = names(.)[max.col(!is.na(.), ties.method = 'first')]) %>% 
 select(names(df), new)

   a  b  c  d new
1  1  2  3  4   c
2 NA NA  1  2   c
3 NA  1  2 NA   c
4 NA NA NA  1   d

To maintain all the variables at the end, you can join on the original data frame, i.e.

df %>% 
 select(c, d, b) %>%
 mutate(new = names(.)[max.col(!is.na(.), ties.method = 'first')]) %>% 
 left_join(df) %>% 
 select(names(df), new)

Joining, by = c("c", "d", "b")
   a  b  c  d new
1  1  2  3  4   c
2 NA NA  1  2   c
3 NA  1  2 NA   c
4 NA NA NA  1   d
Sotos
  • 51,121
  • 6
  • 32
  • 66
  • How can I adapt this code if I want to check only certain variables in my data frame and if I want to be able to change the order in which they are checked for missing values? – Anton May 20 '22 at 09:42
  • I edited for specifying variables. The second inquiry you made about the order I don't understand what you mean – Sotos May 20 '22 at 09:51
  • For example, if I want to check which variable is the first one to not have a missing value but not a, b, c, d but instead maybe c, b, a, d. – Anton May 20 '22 at 09:53
  • Edited my answer. Let me know – Sotos May 20 '22 at 09:55
  • This definitely works, the only issue is that I have other variables in the data frame that get lost if I take this approach. Maybe there is a way such that I have the original dataset in the end but with the additional variable ```new```? – Anton May 20 '22 at 10:02
  • See last update – Sotos May 20 '22 at 10:08
  • 1
    ```df %>% select(c, d, b) %>% mutate(new = names(.)[max.col(!is.na(.), ties.method = 'first')]) %>% select(new) %>% cbind(df) ``` appears to do the trick for me, thank you! – Anton May 20 '22 at 10:21