Presence/absence data.frame after match

Question

I have a huge data.frame that looks like this:

   Gene     Sample1     Sample2      Sample3    .....
   A         0.34        0.99          1
   B         1.3         9.4           67
   D          13         2             284
   H         456         0.11          0.22
   G          0          32            0.8             
    ............

12.000 total rows and 150 columns.

And another vector:

        Measurements    
           0.8
           0.34
           0.22
            1
           32

I simply would like to match the vector with each column of the data.frame and get a final data frame that looks like this:

 Gene     Sample1     Sample2      Sample3    .....
   A         0.34        NA           1
   B          NA         NA           NA
   D          NA         NA           NA
   H          NA         NA          0.22
   G          NA         32          0.8

NA are values that are not in the vector.

Can you clarify the matching rule? Is it make the value `NA` anywhere in the matrix if it does not appear in `Measurements`? Reproducible data sample would be great as well. — Calum You, Oct 01 '18 at 20:12
I don't understand how you are doing your matching here. And are you trying to do an exact match with decimal values? That's [not a great idea](https://stackoverflow.com/questions/9508518/why-are-these-numbers-not-equal) in general. When asking for help, you should include a simple [reproducible example](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) with sample input and desired output that can be used to test and verify possible solutions. (Data with ...'s aren't very reproducible). — MrFlick, Oct 01 '18 at 20:12

score 2 · Accepted Answer · answered Oct 01 '18 at 20:20

You can compare each column with an if_else comparison. But please be careful with this kind of comparison of floating point numbers. For this example it works ok:

library(tidyverse)
tbl <- read_table2(
"Gene     Sample1     Sample2      Sample3
A         0.34        0.99          1
B         1.3         9.4           67
D          13         2             284
H         456         0.11          0.22
G          0          32            0.8"
)
Measurements <- c(0.8, 0.34, 0.22, 1, 32)
tbl %>%
  mutate_at(vars(-Gene), ~if_else(. %in% Measurements, ., NA_real_))
#> # A tibble: 5 x 4
#>   Gene  Sample1 Sample2 Sample3
#>   <chr>   <dbl>   <dbl>   <dbl>
#> 1 A        0.34      NA    1   
#> 2 B       NA         NA   NA   
#> 3 D       NA         NA   NA   
#> 4 H       NA         NA    0.22
#> 5 G       NA         32    0.8

but as this below shows, you don't always have a good way to compare values that should be equal.

(1.1-0.2) %in% c(0.9)
#> [1] FALSE

You can deal with this by matching character vectors instead:

tbl %>%
  mutate_all(as.character) %>%
  mutate_at(vars(-Gene), ~if_else(. %in% as.character(Measurements), ., NA_character_))
#> # A tibble: 5 x 4
#>   Gene  Sample1 Sample2 Sample3
#>   <chr> <chr>   <chr>   <chr>  
#> 1 A     0.34    <NA>    1      
#> 2 B     <NA>    <NA>    <NA>   
#> 3 D     <NA>    <NA>    <NA>   
#> 4 H     <NA>    <NA>    0.22   
#> 5 G     <NA>    32      0.8

but that comes with its own set of problems since numerically equivalent strings won't be equivalent by character.

"0.990" %in% c(0.99)
#> [1] FALSE

Created on 2018-10-01 by the reprex package (v0.2.0).

Presence/absence data.frame after match

1 Answers1