Accent insensitive regex in R

Question

I'm trying to use filter(grepl()) to match some words in my column. Let's suppose I want to extract the word "Guartelá". In my column, i have variations such as "guartela" "guartelá" and "Guartela". To match upper/lowercase words I'm using (?i). However, I haven't found a good way to match accent/no-accent (i.e., "guartelá" and "guartela").

I know that I can simply substitute á by a, but is there a way to assign the accent-insensitive in the code? It can be base R/tidyverse/any, I don't mind.

Here's how my curent code line is:

cobras <- final %>% filter(grepl("(?i)guartelá", NAME) 
                           | grepl("(?i)guartelá", locality))

Cheers

You can remove all accented characters before you try to the match. see https://stackoverflow.com/a/56595128/2372064. But I don't think regular expressions themselves have classes identifying synonymous accented characters. You'd have to define the classes yourself. — MrFlick, Nov 21 '22 at 15:59

score 4 · Answer 1 · answered Nov 21 '22 at 16:14

you can use stri_trans_general fron stringi to remove all accents:

unaccent_chars= stringi::stri_trans_general(c("guartelá","with_é","with_â","with_ô")  ,"Latin-ASCII")
unaccent_chars
# [1] "guartela" "with_e"   "with_a"   "with_o" 
# grepl(paste(unaccent_chars,collapse = "|"), string)

score 2 · Accepted Answer · answered Nov 21 '22 at 15:58

2

You can pass options in OR statements using [ to capture different combinations

> string <- c("Guartelá", "Guartela", "guartela", "guartelá", "any")
> grepl("[Gg]uartel[aá]", string)
[1]  TRUE  TRUE  TRUE  TRUE FALSE

answered Nov 21 '22 at 15:58

Jilber Urbina

58,147
10
114
138

1

Damn, I didn't know I could use [] for this...pretty amazing – hiperhiper Nov 21 '22 at 16:01

score 1 · Answer 3 · answered Nov 21 '22 at 15:59

1

Another option using str_detect():

library(tidyverse)
tibble(name = c("guartela","guartelá", "Guartela", "Other")) |> 
  filter(str_detect(name, "guartela|guartelá|Guartela"))

answered Nov 21 '22 at 15:59

Julian

6,586
2
9
33

Accent insensitive regex in R

3 Answers3