0

I have the following data frame called result.

MANUAL.ID AUTO.ID loc
NA PYPPYP L2
PIPpip NA L1
Barbar NA L5
NA Pippip L3
NA Pippip,BerBar L3

I try to replace all the different forms of a same tag by the right one. For example replace all PIPPIP and PIPpip by Pippip or Berbar by Barbar. To do this, I use a mutate function with case_when based on a required file called tesaurus which have column with all the possible case of a same tag (tag_id) and a column with the correct one (tag_ok) which looks like this :

tag_id tag_ok
PYPPYP Pippip
PIPpip Pippip
Pippip Pippip
Barbar Barbar

I use the following code.

library("plyr")
library("dplyr")
library("reshape")
library("data.table")
library("stringr")
library("tidyr")
library("openxlsx")

tesaurus <- read.xlsx("Requested_files/sp_tesaurus.xlsx", sheet = "tesaurus") %>%
  select(-bat_sp)

result <- result %>%
  mutate_at(
    vars(contains("ID")),
    list(as.formula(
      paste0(
        "~ case_when(",
        paste(
          sprintf(
            "str_detect(., pattern = regex('%s', ignore_case = TRUE)) ~ '%s'",
            tesaurus$tag_id,
            tesaurus$tag_ok
          ),
          collapse = ", "
        ), ", TRUE ~ .)"
      )
    ))
  )

My problem is that when I run it I have this error message :

Error in `mutate()`:
ℹ In argument: `AUTO.ID = (structure(function (..., .x = ..1, .y = ..2, . = ..1) ...`.
Caused by error in `case_when()`:
! Can't combine `..1 (right)` <character> and `..161 (right)` <double>.
---
Backtrace:
     ▆
  1. ├─result %>% ...
  2. ├─dplyr::mutate_at(...)
  3. │ ├─dplyr::mutate(.tbl, !!!funs)
  4. │ └─dplyr:::mutate.data.frame(.tbl, !!!funs)
  5. │   └─dplyr:::mutate_cols(.data, dplyr_quosures(...), by)
  6. │     ├─base::withCallingHandlers(...)
  7. │     └─dplyr:::mutate_col(dots[[i]], data, mask, new_columns)
  8. │       └─mask$eval_all_mutate(quo)
  9. │         └─dplyr (local) eval()
 10. ├─`<inln_cl_>`(AUTO.ID)
 11. │ └─rlang::eval_bare(`_quo`, base::parent.frame())
 12. ├─rlang (local) case_when(...)
 13. └─dplyr::case_when(...)

So, what is the problem and how can I fix it ?

lobarth
  • 43
  • 6
  • tolower or toupper wouldn't work in this case? Related post https://stackoverflow.com/q/6364783/680068 – zx8754 Apr 14 '23 at 11:46
  • No because it's not just about capital letter. Ther is some case with other problems like Pyppyp instead of the right one Pippip.@zx8754 – lobarth Apr 14 '23 at 11:52

1 Answers1

1

I think a join/coalesce method will work here. Updated with your new data (i.e., "PYPPYP" to "Pippip"):

result %>%
  left_join(tesaurus, by = c(MANUAL.ID = "tag_id")) %>%
  mutate(MANUAL.ID = coalesce(tag_ok, MANUAL.ID)) %>%
  select(-tag_ok) %>%
  left_join(tesaurus, by = c(AUTO.ID = "tag_id")) %>%
  mutate(AUTO.ID = coalesce(tag_ok, AUTO.ID)) %>%
  select(-tag_ok)
#   MANUAL.ID AUTO.ID loc
# 1      <NA>  Pippip  L2
# 2    Pippip    <NA>  L1
# 3    Barbar    <NA>  L5
# 4      <NA>  Pippip  L3

or we can use match (though I think it's not better than above):

result %>%
  mutate(across(c(MANUAL.ID, AUTO.ID),
    ~ coalesce(tesaurus$tag_ok[match(., tesaurus$tag_id)], .)))
#   MANUAL.ID AUTO.ID loc
# 1      <NA>  Pippip  L2
# 2    Pippip    <NA>  L1
# 3    Barbar    <NA>  L5
# 4      <NA>  Pippip  L3

Your added example with "Pippip,BerBar" is going to do nothing since you did not include BerBar in the tesaurus lookup dictionary. However, if we add it:

tesaurus <- bind_rows(tesaurus, data.frame(tag_id="BerBar", tag_ok="Barbar"))
tesaurus
#   tag_id tag_ok
# 1 PYPPYP Pippip
# 2 PIPpip Pippip
# 3 Pippip Pippip
# 4 Barbar Barbar
# 5 BerBar Barbar

we can use strsplit and the match lookup as follows:

result %>%
  mutate(across(c(MANUAL.ID, AUTO.ID),
     ~ sapply(strsplit(., "[, ]+"), function(st) paste(coalesce(tesaurus$tag_ok[match(st, tesaurus$tag_id)], st), collapse = ","))))
#   MANUAL.ID       AUTO.ID loc
# 1        NA        Pippip  L2
# 2    Pippip            NA  L1
# 3    Barbar            NA  L5
# 4        NA        Pippip  L3
# 5        NA Pippip,Barbar  L3
r2evans
  • 141,215
  • 6
  • 77
  • 149
  • this code works well for the MANUAL.ID column but for the AUTO0ID column because of the special case : Pippip,BerBar (there is two tad in one case) – lobarth Apr 14 '23 at 14:26
  • @barthdufau See my edit, I think it resolves your problem. But in the future please try to avoid hidden features like that when it is clearly a factor in the methodology. – r2evans Apr 14 '23 at 14:32
  • Thanks ! Does I just need to executed the new line after the previous code ? and remove the # ? – lobarth Apr 14 '23 at 14:41
  • 1
    oops, edited @barthdufau – r2evans Apr 14 '23 at 14:51