R - data cleaning and mutate errors

Question

I have reviewed Error: Problem with mutate() column (...) must be size 15 or 1, not 17192, How to drop columns with column names that contain specific string?, Remove columns that contain a specific word, and associated error troubleshooting.

I have a large dataset with viral data for different species in different areas - sample data below

Country    ..2  Area    Site    ID      Species Sample    Original Sample/Specimen #
<chr>     <lgl> <chr>   <chr>   <chr>   <chr>   <chr>    <chr>
Tanzania    NA  UMNP    UMNPhq  AATPH   PG     Feces    AATPHF2 
Tanzania    NA  UMNP    UMNPhq  AATPI   PG     Feces    AATPIF2 
Tanzania    NA  UMNP    UMNPhq  AATPJ   PG     Feces    AATPJF2 
Tanzania    NA  UMNP    UMNPhq  ATTPK   PG     Feces    ATTPKF2 
Tanzania    NA  UMNP    UMNPhq  AATPL   PG     Feces    AATPLF2 

Filovirus (MOD) PCR  Date (Filo MOD)
<chr>                <date>
Indeterminant        2015-03-16
Indeterminant        2015-03-16
Indeterminant        2015-03-16
Indeterminant        2015-03-16
Negative             2015-03-16

I am trying to recode a viral status, positive or negative, for every sample id (just filovirus here, but there's a lot of them, so please help code more generally)

Code I've tried - first subsetting data to only include a specific area

viral <- subset(data, Area %in% "UMNP")

Here I got rid of unwanted columns and then was able to get infection status, but it converted all other information on the sample to "NA" causing additional error codes when I try to maintain the values.

viralres <- viral %>% 
     dplyr::select(-matches(c('Performed by ()', 'performed by', 'Date of', '1Performed by', 'Performed by', "Date ()", "...2"),)) %>%
    mutate_if(is.character, ~case_when(. == "Indeterminant" ~ "0", 
                                       . == "Negative" ~ "0", 
                                       . == "Positive" ~ "1"))

Dput

structure(list(Country = c("Tanzania", "Tanzania", "Tanzania", 
"Tanzania", "Tanzania"), ...2 = c(NA, NA, NA, NA, NA), Area = c("UMNP", 
"UMNP", "UMNP", "UMNP", "UMNP"), Site = c("UMNPhq", "UMNPhq", 
"UMNPhq", "UMNPhq", "UMNPhq"), `Animal ID` = c("AATPH", "AATPI", 
"AATPJ", "ATTPK", "AATPL"), Species = c("Procolobus gordonorum", 
"Procolobus gordonorum", "Procolobus gordonorum", "Procolobus gordonorum", 
"Procolobus gordonorum"), `Sample Type` = c("Feces", "Feces", 
"Feces", "Feces", "Feces"), `Original Sample/Specimen #` = c("AATPHF2", 
"AATPIF2", "AATPJF2", "ATTPKF2", "AATPLF2"), `Filovirus (MOD) PCR` = c("Indeterminant", 
"Indeterminant", "Indeterminant", "Indeterminant", "Negative"
), `Date (Filo MOD)` = structure(c(16510, 16510, 16510, 16510, 
16510), class = "Date")), row.names = c(NA, -5L), class = c("tbl_df", 
"tbl", "data.frame"))

score 0 · Accepted Answer · answered Mar 02 '23 at 16:14

0

Using mutate_if(is.character, ...) will change all of your character columns. It looks like the only column you are trying to change is "Filovirus (MOD) PCR". So you could change the command to

viral %>% 
  dplyr::select(-matches(c('Performed by ()', 'performed by', 'Date of', '1Performed by', 'Performed by', "Date ()", "...2"),)) %>%
  mutate(across(`Filovirus (MOD) PCR`, ~case_when(. == "Indeterminant" ~ "0", 
                                     . == "Negative" ~ "0", 
                                     . == "Positive" ~ "1")))

for the least amount of change. That way you are only changing that column. Alternatively you could more directly mutate that single column using case_match

viral %>% 
  dplyr::select(-matches(c('Performed by ()', 'performed by', 'Date of', '1Performed by', 'Performed by', "Date ()", "...2"),)) %>%
  mutate(`Filovirus (MOD) PCR` = case_match(`Filovirus (MOD) PCR`,"Indeterminant" ~ "0", 
                                     "Negative" ~ "0", 
                                     "Positive" ~ "1"))

Note that case_match was introduced in dplyr 1.1.0

answered Mar 02 '23 at 16:14

MrFlick

195,160
17
277
295

This is just sample code, so yes, here it is only one column. However, in the larger dataset, it is many columns. – Marnee Roundtree Mar 02 '23 at 16:37
1

If it's many columns, then you can add them all to the `across()` call. You can either list them explicitly or come up with a more selective pattern that just `is.character` to identify them. – MrFlick Mar 02 '23 at 16:40
```Error in `mutate()`: ℹ In argument: `~...`. Caused by error: ! `~...` must be a vector, not a object.``` – Marnee Roundtree Mar 02 '23 at 16:56
@MarneeRoundtree Did you get that error with the test data you provided? If I copy/paste into R it works fine for me. What version of `dplyr` are you using? – MrFlick Mar 02 '23 at 16:59
I think I'm using dplyr 1.1.0, but I just updated my tidy verse I thought. – Marnee Roundtree Mar 02 '23 at 17:05

score 0 · Answer 2 · answered Mar 02 '23 at 16:24

0

Use mutate_at instead of mutate_if.

viralres <- viral %>% 
     dplyr::select(-matches(c('Performed by ()', 'performed by', 'Date of', '1Performed by', 'Performed by', "Date ()", "...2"),)) %>%
     mutate_at(c("Filovirus (MOD) PCR"), ~case_when(. == "Indeterminant" ~ "0",
                                                    . == "Negative" ~ "0", 
                                                    . == "Positive" ~ "1"))

In the first argument of mutate_at, add all your sample id (Filovirus, etc...) in a vector.

answered Mar 02 '23 at 16:24

Alexis van STRAATEN

51
4

```Error in `mutate_at()`: ! `.vars` must be a character/numeric vector or a `vars()` object, not a object.``` – Marnee Roundtree Mar 02 '23 at 16:36
1

Note that `mutate_at` has been superseded in newer versions of dplyr. The help page encourages the use of `across()` instead. – MrFlick Mar 02 '23 at 16:44
@MrFlick Yeah, good one ! – Alexis van STRAATEN Mar 02 '23 at 16:49
@MarneeRoundtree Strange because c() isn't formula object... replace the 'c' by 'vars' in the `mutate_at()`. – Alexis van STRAATEN Mar 02 '23 at 16:50
```Warning: NAs introduced by coercionError in "Filovirus (MOD) PCR":"Phlebo (Sanchez-Seco) PCR" : NA/NaN argument``` – Marnee Roundtree Mar 02 '23 at 16:59

R - data cleaning and mutate errors

2 Answers2