Replacing subset of values with NA if I only know a part of each value

Question

I have a very large dataset and I want to change every value with either a "<" or ">" into an NA. I tried using the following command from the naniar package:

df %>% replace_with_na_at(.vars = c("A", "B", "C", "D", "E", "F", "G", "H", "I", "J", "K", "L", "M", "N", "O"), condition = ~.x == ">") %>% print()

The thing is, the "<" or ">" is only a part of each value, but the dataset is way too large (the dput below is barely a fraction of it) for me to specify every individual value I want to replace. How do I select every value that simply has a ">" or "<" in it and replace it with NA?

structure(list(`Analyte  Sample` = c(1, 2, 3, 4, 5, 6, 7, 8, 
9, 10, 11, 12, 13, 14), A = c("4190", "6665", "7435", "2052", 
"783", "322", "199", "90", "46", "17", "8", "3", "3", "<1↓"
), B = c("11569", "6677", "3852", "983.88", "589", "359", "203", 
"68", "33", "12", "6", "<2↓", "4", "<1↓"), C = c("20453", 
"7699", "2499", "707.98", "412", "328", "156", "88", "39", "27", 
"17", "<1↓", "<3↓", "<1↓"), D = c("7893", ">20000↑", 
"1623", "685.64", "321", "644", "112", "65", "35", "29", "9", 
"5", "<3↓", "<1↓"), E = c("320", "15444", "2049", "1065", 
"389", "365", "145", "77", "38", "16", "9", "6", "<2↓", "<2↓"
), F = c("7438", ">21999↑", "3472", "1057", "563", "401", "167", 
"89", "46", "19", "6", "<1↓", "<1↓", "<1↓"), G = c(7345, 
9001, 2473, 1138, 516, 403, 134, 81, 37, 17, 8, 6, 4, 3), H = c("9004", 
"3998", "2299", "964.88", "499", "341", "112", "88", "39", "32", 
"<29↓", "<30↓", "<31↓", "<29↓"), I = c("8434", "8700", 
"2217", "1263", "567", "352", "153", "80", "43", "18", "9", "2", 
"3", "<1↓"), J = c("7734", "6733", "2092", "1115", "637", "332", 
"155", "82", "37", "17", "10", "4", "1", "<1↓"), K = c(">3718↑", 
">3000↑", "2118", "862.13", "426", "355", "143", "78", "44", 
"22", "11", "<4↓", "<4↓", "<3↓"), L = c(6345, 7688, 2311, 
1195, 647, 366, 177, 83, 41, 20, 8, 6, 3, 2), M = c("4222", ">25587↑", 
"1846", "814.61", "422", "314", "154", "86", "41", "27", "21", 
"<2↓", "<2↓", "<3↓"), N = c("6773", "8934", "2381", "1221", 
"677", "356", "146", "89", "40", "17", "10", "5", "2", "<2↓"
), O = c(">2200↑", ">2133↑", ">2000↑", "564.5", "226", 
"476", "111", "60", "32", "36", "18", "<10↓", "<1↓", "<2↓"
)), class = c("spec_tbl_df", "tbl_df", "tbl", "data.frame"), row.names = c(NA, 
-14L), spec = structure(list(cols = list(`Analyte  Sample` = structure(list(), class = c("collector_double", 
"collector")), A = structure(list(), class = c("collector_character", 
"collector")), B = structure(list(), class = c("collector_character", 
"collector")), C = structure(list(), class = c("collector_character", 
"collector")), D = structure(list(), class = c("collector_character", 
"collector")), E = structure(list(), class = c("collector_character", 
"collector")), F = structure(list(), class = c("collector_character", 
"collector")), G = structure(list(), class = c("collector_double", 
"collector")), H = structure(list(), class = c("collector_character", 
"collector")), I = structure(list(), class = c("collector_character", 
"collector")), J = structure(list(), class = c("collector_character", 
"collector")), K = structure(list(), class = c("collector_character", 
"collector")), L = structure(list(), class = c("collector_double", 
"collector")), M = structure(list(), class = c("collector_character", 
"collector")), N = structure(list(), class = c("collector_character", 
"collector")), O = structure(list(), class = c("collector_character", 
"collector"))), default = structure(list(), class = c("collector_guess", 
"collector")), skip = 1), class = "col_spec"))

slava-kohut · Accepted Answer · 2020-07-31T14:42:53.970

You can use stringr::str_detect and apply to vectorize over the entire df:

 df[apply(df, 2, function(x) stringr::str_detect(x, "<|>"))] <- NA

where df is your data frame. You can convert to numeric if you need to afterwards, e.g., using dplyr:

df %>% mutate_if(~!is.numeric(.), ~as.numeric(.))

Output

> df
   Analyte  Sample    A      B      C      D     E    F    G      H    I    J      K    L      M    N     O
1                1 4190  11569  20453   7893   320 7438 7345   9004 8434 7734   <NA> 6345   4222 6773  <NA>
2                2 6665   6677   7699   <NA> 15444 <NA> 9001   3998 8700 6733   <NA> 7688   <NA> 8934  <NA>
3                3 7435   3852   2499   1623  2049 3472 2473   2299 2217 2092   2118 2311   1846 2381  <NA>
4                4 2052 983.88 707.98 685.64  1065 1057 1138 964.88 1263 1115 862.13 1195 814.61 1221 564.5
5                5  783    589    412    321   389  563  516    499  567  637    426  647    422  677   226
6                6  322    359    328    644   365  401  403    341  352  332    355  366    314  356   476
7                7  199    203    156    112   145  167  134    112  153  155    143  177    154  146   111
8                8   90     68     88     65    77   89   81     88   80   82     78   83     86   89    60
9                9   46     33     39     35    38   46   37     39   43   37     44   41     41   40    32
10              10   17     12     27     29    16   19   17     32   18   17     22   20     27   17    36
11              11    8      6     17      9     9    6    8   <NA>    9   10     11    8     21   10    18
12              12    3   <NA>   <NA>      5     6 <NA>    6   <NA>    2    4   <NA>    6   <NA>    5  <NA>
13              13    3      4   <NA>   <NA>  <NA> <NA>    4   <NA>    3    1   <NA>    3   <NA>    2  <NA>
14              14 <NA>   <NA>   <NA>   <NA>  <NA> <NA>    3   <NA> <NA> <NA>   <NA>    2   <NA> <NA>  <NA>

When I tried this, I received the following error. Do you know what went wrong? Error in apply(df, 2, function(x) stringr::str_detect(x, "<|>")) : dim(X) must have a positive length — Ree Nadeau, Jul 31 '20 at 15:09

Replacing subset of values with NA if I only know a part of each value

1 Answers1