2

I'm trying to use filter(grepl()) to match some words in my column. Let's suppose I want to extract the word "Guartelá". In my column, i have variations such as "guartela" "guartelá" and "Guartela". To match upper/lowercase words I'm using (?i). However, I haven't found a good way to match accent/no-accent (i.e., "guartelá" and "guartela").

I know that I can simply substitute á by a, but is there a way to assign the accent-insensitive in the code? It can be base R/tidyverse/any, I don't mind.

Here's how my curent code line is:

cobras <- final %>% filter(grepl("(?i)guartelá", NAME) 
                           | grepl("(?i)guartelá", locality))

Cheers

hiperhiper
  • 331
  • 1
  • 9
  • 3
    You can remove all accented characters before you try to the match. see https://stackoverflow.com/a/56595128/2372064. But I don't think regular expressions themselves have classes identifying synonymous accented characters. You'd have to define the classes yourself. – MrFlick Nov 21 '22 at 15:59

3 Answers3

4

you can use stri_trans_general fron stringi to remove all accents:

unaccent_chars= stringi::stri_trans_general(c("guartelá","with_é","with_â","with_ô")  ,"Latin-ASCII")
unaccent_chars
# [1] "guartela" "with_e"   "with_a"   "with_o" 
# grepl(paste(unaccent_chars,collapse = "|"), string)  
islem
  • 236
  • 1
  • 6
2

You can pass options in OR statements using [ to capture different combinations

> string <- c("Guartelá", "Guartela", "guartela", "guartelá", "any")
> grepl("[Gg]uartel[aá]", string)
[1]  TRUE  TRUE  TRUE  TRUE FALSE
Jilber Urbina
  • 58,147
  • 10
  • 114
  • 138
1

Another option using str_detect():

library(tidyverse)
tibble(name = c("guartela","guartelá", "Guartela", "Other")) |> 
  filter(str_detect(name, "guartela|guartelá|Guartela"))
Julian
  • 6,586
  • 2
  • 9
  • 33