1

I have a file with around 260 customers, but because the name of the same customer is spelled in many different ways, as in the following example:

Cesar Fereira
Cesar Féreira   
César Fereira   
Cezar Fereira

because of this I have about 1000 different names, so I would like a recommendation approach on how to proceed with name correction in chain, for all customers using R package or a kind of approach.

Curious G.
  • 838
  • 8
  • 23
  • 2
    various distances are used in such cases, but I am not sure if there is a way tp actually do the trick without the user to check the results. I would suggest something like stringdist package. – NpT Oct 25 '19 at 13:13
  • 2
    [Relevant](https://stackoverflow.com/questions/6044112/how-to-measure-similarity-between-strings) – Sotos Oct 25 '19 at 13:16

1 Answers1

2

If you are dealing not only with accents, but alternative letters agrep might be a solution.

d <- c("Cesar Fereira", "Cesar Féreira", "César Fereira ", "Cezar Fereira")
lapply(d,function(x){agrep(x,d,max.distance = 0.1, ignore.case = T, value = T)})

EDIT expanding on Parfait's proposal you could

library(dplyr)
d <- c("Cesar Fereira", "Cesar Féreira", "César Fereira ", "Cezar Fereira", "Zebra", "Zébra")
expand.grid(d,d) %>% mutate(same = agrepl(Var1,Var2,max.distance = 7))

playing around with max.distance, it seems that the selectivity is not very good as you can see. Bummer.

Janhoo
  • 597
  • 5
  • 21
  • 1
    You can even use [`agrepl`](https://www.rdocumentation.org/packages/base/versions/3.6.1/topics/agrep) inside `ifelse` for OP's need of *correction*: `clean_d <- ifelse(agrepl("Cesar Fereira", d, max.distance = 0.1, ignore.case = TRUE), "Cesar Fereira", d)` – Parfait Oct 25 '19 at 13:46
  • @Janhoo, I wanted a recommendation and I found a solution, thanks. =D – Curious G. Oct 25 '19 at 15:13