Deal with missing data in row-wise manipulation

Question

ID <- 1:6
math <- c("YES","NO","YES","NO",NA,NA)
history <- c(NA,NA,"NO","NO","YES",NA)

dt <- data.frame(ID, math, history)

  ID math history
1  1  YES    <NA>
2  2   NO    <NA>
3  3  YES      NO
4  4   NO      NO
5  5 <NA>     YES
6  6 <NA>    <NA>

I want to make an additional column ("pass") as follows

If a student ever had "yes" at least once: "YES" (No matter if another subject is missing data or not or so.)
If the student did not get "yes"
- If both subjects are missing data : NA
- if one of subjects is "NO" : "NO"

so, the columns would like this: (I can do this manually with this minimal example. but not with my real data)

> dt
  ID math history pass
1  1  YES    <NA>  YES
2  2   NO    <NA>   NO
3  3  YES      NO  YES
4  4   NO      NO   NO
5  5 <NA>     YES  YES
6  6 <NA>    <NA> <NA>

I tried to use

dt$pass <- ifelse(rowSums(dt[,-1]=="YES",na.rm=T)>0,"YES","NO")

this code, but it was tricky because if I put na.rm=TRUE they consider NA is "NO" (ID 6 student will be "NO")

if I put na.rm=FALSE, an only students that have both subject's data are considered.

In my data, I have really lots of columns, not only math and history.

This doesn't seem to be asking a question, in either the body or the title. — TylerH, Jun 22 '22 at 13:16

Darren Tsai · Accepted Answer · 2022-06-10T09:38:51.390

6

A simple base solution is

dt$pass <- apply(dt[-1], 1, \(x) sort(x, dec = TRUE)[1])

# > dt
#   ID math history pass
# 1  1  YES    <NA>  YES
# 2  2   NO    <NA>   NO
# 3  3  YES      NO  YES
# 4  4   NO      NO   NO
# 5  5 <NA>     YES  YES
# 6  6 <NA>    <NA> <NA>

Its dplyr equivalent is

library(dplyr)

dt %>%
  rowwise() %>%
  mutate(pass = sort(c_across(-1), dec = TRUE)[1]) %>%
  ungroup()

edited Jun 10 '22 at 09:38

answered Jun 10 '22 at 09:27

Darren Tsai

32,117
5
21
51

Mohamed Desouky · Answer 2 · 2022-06-10T11:39:52.833

4

Try this

fun <- function(x){
  if(all(is.na(x))) return(NA_character_)
  if(any(na.omit(x == "YES"))) return("YES")
  return("NO")
}

dt %>% rowwise() %>% mutate(pass = fun(c_across(-ID)))

edited Jun 10 '22 at 11:39

answered Jun 10 '22 at 09:01

Mohamed Desouky

4,340
2
4
19

score 3 · Answer 3 · answered Jun 10 '22 at 09:04

A tidyverse solution that is robust with respect to the number (and names) of subjects:

library(tidyverse)

dt %>% 
  mutate(
    pass=dt %>% 
           pivot_longer(-ID) %>% 
           group_by(ID) %>% 
           summarise(
             anyYes=sum(value == "YES", na.rm=T), 
             anyNo=sum(value == "NO", na.rm=T)
           ) %>% 
           mutate(
             pass=ifelse(
                    anyYes >= 1, 
                    "YES", 
                    ifelse(anyNo >= 1, "NO", NA)
                   )
           ) %>% 
           pull(pass)
  )
  ID math history pass
1  1  YES    <NA>  YES
2  2   NO    <NA>   NO
3  3  YES      NO  YES
4  4   NO      NO   NO
5  5 <NA>     YES  YES
6  6 <NA>    <NA> <NA>

The key is the pivoting to long format.

Carl · Answer 4 · 2022-06-11T10:13:52.933

library(tidyverse)

ID <- 1:6
math <- c("YES", "NO", "YES", "NO", NA, NA)
history <- c(NA, NA, "NO", "NO", "YES", NA)

dt <- data.frame(ID, math, history)

dt |> 
  rowwise() |> 
  mutate(pass = case_when(
    sum(c_across(-ID) == "YES", na.rm = TRUE) >= 1 ~ "YES",
    sum(c_across(-ID) == "NO", na.rm = TRUE) >= 1  ~ "No",
    TRUE                                           ~ NA_character_
  )) 

# Add ungroup() if you need to do further ungrouped processing.

#> # A tibble: 6 × 4
#>      ID math  history pass 
#>   <int> <chr> <chr>   <chr>
#> 1     1 YES   <NA>    YES  
#> 2     2 NO    <NA>    No   
#> 3     3 YES   NO      YES  
#> 4     4 NO    NO      No   
#> 5     5 <NA>  YES     YES  
#> 6     6 <NA>  <NA>    <NA>

^{Created on 2022-06-10 by the reprex package (v2.0.1)}

GKi · Answer 5 · 2022-06-10T10:50:38.940

3

Maybe rowMeans shows the behaviour you would like rowSums shoul have. In case there are only NA it returns NA.

c("NO", "YES")[1 + (rowMeans(dt[-1] == "YES", TRUE) > 0)]
#[1] "YES" "NO"  "YES" "NO"  "YES" NA

Or using your line of code:

ifelse(rowMeans(dt[-1]=="YES",na.rm=T)>0,"YES","NO")
#[1] "YES" "NO"  "YES" "NO"  "YES" NA

edited Jun 10 '22 at 10:50

answered Jun 10 '22 at 10:35

GKi

37,245
2
26
48

score 2 · Answer 6 · answered Jun 10 '22 at 08:59

2

Seems like the order is what matters:

dt$pass <- NA
dt$pass[dt$math == "NO" | dt$history=="NO"] <- "NO"
dt$pass[dt$math == "YES" | dt$history=="YES"] <- "YES"

> dt
  ID math history pass
1  1  YES    <NA>  YES
2  2   NO    <NA>   NO
3  3  YES      NO  YES
4  4   NO      NO   NO
5  5 <NA>     YES  YES
6  6 <NA>    <NA> <NA>

answered Jun 10 '22 at 08:59

tavdp

421
2
6
11

1

If there are more than 2 subjects, you need to write a bunch of lines to identify pass or not. – Darren Tsai Jun 10 '22 at 09:13
1

Yes, my answer was before the edit. I agree that the answer by @Carl is more generalizable. – tavdp Jun 10 '22 at 09:18

Deal with missing data in row-wise manipulation

6 Answers6