1

I have a dataset that includes some non-referenced data that I would like to replace with NA. In the following example, if the data in columns rep1 to rep4 does not match one of the values in the ID column, I would like to replace the value with NA. In this case, the values of x, y, and z aren't listed in the ID column, so they should be replaced.

This is a somewhat similar question that I asked earlier here : If data present, replace with data from another column based on row ID

I think the solution will be similar to what was given in the previous question, but I don't know how to alter the second portion ~ value[match(., ID)] to return NA for values that aren't listed in the ID column.

df %>% mutate_at(vars(rep1:rep4), ~ value[match(., ID)])

ID  rep1  rep2  rep3  rep4  
a                           
b   a                       
c   a     b                 
d   a     b     c           
e   a     b     c     d     
f                           
g   x                       
h                           
i                           
j   y     z                 
k   z                       
l                           
m                           

The result should look like this:

ID  rep1  rep2  rep3  rep4  
a                           
b   a                       
c   a     b                 
d   a     b     c           
e   a     b     c     d     
f                           
g   NA                      
h                           
i                           
j   NA    NA                    
k   NA                      
l                           
m                           

Here is the data using dput()

structure(list(ID = structure(1:13, .Label = c("a", "b", "c", 
"d", "e", "f", "g", "h", "i", "j", "k", "l", "m"), class = "factor"), 
    rep1 = structure(c(1L, 2L, 2L, 2L, 2L, 1L, 3L, 1L, 1L, 4L, 
    5L, 1L, 1L), .Label = c("", "a", "x", "y", "z"), class = "factor"), 
    rep2 = structure(c(1L, 1L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 3L, 
    1L, 1L, 1L), .Label = c("", "b", "z"), class = "factor"), 
    rep3 = structure(c(1L, 1L, 1L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 
    1L, 1L, 1L), .Label = c("", "c"), class = "factor"), rep4 = structure(c(1L, 
    1L, 1L, 1L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = c("", 
    "d"), class = "factor")), class = "data.frame", row.names = c(NA, -13L))
Corey
  • 405
  • 2
  • 6
  • 18

2 Answers2

3

An dplyralternative using replace()

df %>%
  mutate_at(vars(rep1:rep4), ~replace(., which(!(. %in% ID | . == "")), NA))

   ID rep1 rep2 rep3 rep4
1   a                    
2   b    a               
3   c    a    b          
4   d    a    b    c     
5   e    a    b    c    d
6   f                    
7   g <NA>               
8   h                    
9   i                    
10  j <NA> <NA>          
11  k <NA>               
12  l                    
13  m 

Humpelstielzchen
  • 6,126
  • 3
  • 14
  • 34
  • Thanks for your help. Out of curiosity, if the ID column doesn't have any missing values, is the second part of the function df == '' necessary? – Corey Oct 11 '19 at 14:26
  • 1
    This part is always necessary, because otherwise your blank fields will get an `NA`, since blank doesn't appear in the `ID`-column. And your desired output shows `NA`only for those fields that don't appear in `ID` AND are not blank. – Humpelstielzchen Oct 11 '19 at 17:49
2

We can use sapply and replace values to NA if they are not present in ID or have blank value.

df[!(sapply(df, `%in%`, df$ID) | df == '')] <- NA
df

#   ID rep1 rep2 rep3 rep4
#1   a                    
#2   b    a               
#3   c    a    b          
#4   d    a    b    c     
#5   e    a    b    c    d
#6   f                    
#7   g <NA>               
#8   h                    
#9   i                    
#10  j <NA> <NA>          
#11  k <NA>               
#12  l                    
#13  m                   
Ronak Shah
  • 377,200
  • 20
  • 156
  • 213
  • Thanks for your help. Out of curiosity, if the ID column doesn't have any missing values, is the second part of the function df == '' necessary? – Corey Oct 11 '19 at 14:25
  • 1
    @Corey If by missing value you mean empty values, no `df == ""` wouldn't be necessary. Here we use that because we want to keep them as empty values and not replace them with `NA`. – Ronak Shah Oct 11 '19 at 14:27