0

I have reviewed Error: Problem with mutate() column (...) must be size 15 or 1, not 17192, How to drop columns with column names that contain specific string?, Remove columns that contain a specific word, and associated error troubleshooting.

I have a large dataset with viral data for different species in different areas - sample data below

Country    ..2  Area    Site    ID      Species Sample    Original Sample/Specimen #
<chr>     <lgl> <chr>   <chr>   <chr>   <chr>   <chr>    <chr>
Tanzania    NA  UMNP    UMNPhq  AATPH   PG     Feces    AATPHF2 
Tanzania    NA  UMNP    UMNPhq  AATPI   PG     Feces    AATPIF2 
Tanzania    NA  UMNP    UMNPhq  AATPJ   PG     Feces    AATPJF2 
Tanzania    NA  UMNP    UMNPhq  ATTPK   PG     Feces    ATTPKF2 
Tanzania    NA  UMNP    UMNPhq  AATPL   PG     Feces    AATPLF2 

Filovirus (MOD) PCR  Date (Filo MOD)
<chr>                <date>
Indeterminant        2015-03-16
Indeterminant        2015-03-16
Indeterminant        2015-03-16
Indeterminant        2015-03-16
Negative             2015-03-16

I am trying to recode a viral status, positive or negative, for every sample id (just filovirus here, but there's a lot of them, so please help code more generally)

Code I've tried - first subsetting data to only include a specific area

viral <- subset(data, Area %in% "UMNP")

Here I got rid of unwanted columns and then was able to get infection status, but it converted all other information on the sample to "NA" causing additional error codes when I try to maintain the values.

viralres <- viral %>% 
     dplyr::select(-matches(c('Performed by ()', 'performed by', 'Date of', '1Performed by', 'Performed by', "Date ()", "...2"),)) %>%
    mutate_if(is.character, ~case_when(. == "Indeterminant" ~ "0", 
                                       . == "Negative" ~ "0", 
                                       . == "Positive" ~ "1"))

Dput

structure(list(Country = c("Tanzania", "Tanzania", "Tanzania", 
"Tanzania", "Tanzania"), ...2 = c(NA, NA, NA, NA, NA), Area = c("UMNP", 
"UMNP", "UMNP", "UMNP", "UMNP"), Site = c("UMNPhq", "UMNPhq", 
"UMNPhq", "UMNPhq", "UMNPhq"), `Animal ID` = c("AATPH", "AATPI", 
"AATPJ", "ATTPK", "AATPL"), Species = c("Procolobus gordonorum", 
"Procolobus gordonorum", "Procolobus gordonorum", "Procolobus gordonorum", 
"Procolobus gordonorum"), `Sample Type` = c("Feces", "Feces", 
"Feces", "Feces", "Feces"), `Original Sample/Specimen #` = c("AATPHF2", 
"AATPIF2", "AATPJF2", "ATTPKF2", "AATPLF2"), `Filovirus (MOD) PCR` = c("Indeterminant", 
"Indeterminant", "Indeterminant", "Indeterminant", "Negative"
), `Date (Filo MOD)` = structure(c(16510, 16510, 16510, 16510, 
16510), class = "Date")), row.names = c(NA, -5L), class = c("tbl_df", 
"tbl", "data.frame"))

2 Answers2

0

Using mutate_if(is.character, ...) will change all of your character columns. It looks like the only column you are trying to change is "Filovirus (MOD) PCR". So you could change the command to

viral %>% 
  dplyr::select(-matches(c('Performed by ()', 'performed by', 'Date of', '1Performed by', 'Performed by', "Date ()", "...2"),)) %>%
  mutate(across(`Filovirus (MOD) PCR`, ~case_when(. == "Indeterminant" ~ "0", 
                                     . == "Negative" ~ "0", 
                                     . == "Positive" ~ "1")))

for the least amount of change. That way you are only changing that column. Alternatively you could more directly mutate that single column using case_match

viral %>% 
  dplyr::select(-matches(c('Performed by ()', 'performed by', 'Date of', '1Performed by', 'Performed by', "Date ()", "...2"),)) %>%
  mutate(`Filovirus (MOD) PCR` = case_match(`Filovirus (MOD) PCR`,"Indeterminant" ~ "0", 
                                     "Negative" ~ "0", 
                                     "Positive" ~ "1"))

Note that case_match was introduced in dplyr 1.1.0

MrFlick
  • 195,160
  • 17
  • 277
  • 295
  • This is just sample code, so yes, here it is only one column. However, in the larger dataset, it is many columns. – Marnee Roundtree Mar 02 '23 at 16:37
  • 1
    If it's many columns, then you can add them all to the `across()` call. You can either list them explicitly or come up with a more selective pattern that just `is.character` to identify them. – MrFlick Mar 02 '23 at 16:40
  • ```Error in `mutate()`: ℹ In argument: `~...`. Caused by error: ! `~...` must be a vector, not a object.``` – Marnee Roundtree Mar 02 '23 at 16:56
  • @MarneeRoundtree Did you get that error with the test data you provided? If I copy/paste into R it works fine for me. What version of `dplyr` are you using? – MrFlick Mar 02 '23 at 16:59
  • I think I'm using dplyr 1.1.0, but I just updated my tidy verse I thought. – Marnee Roundtree Mar 02 '23 at 17:05
0

Use mutate_at instead of mutate_if.

viralres <- viral %>% 
     dplyr::select(-matches(c('Performed by ()', 'performed by', 'Date of', '1Performed by', 'Performed by', "Date ()", "...2"),)) %>%
     mutate_at(c("Filovirus (MOD) PCR"), ~case_when(. == "Indeterminant" ~ "0",
                                                    . == "Negative" ~ "0", 
                                                    . == "Positive" ~ "1"))

In the first argument of mutate_at, add all your sample id (Filovirus, etc...) in a vector.