0

I want to group_by similar results (not unique) and I don't know how to do it.

I mean, I have a df with a column called 'name' that has similar results like: ARPO, ARPO S.L, ARPO, SL, etc.

|---------------------|------------------|
|      name           |     address      |
|---------------------|------------------|
|       ARPO          |     street 1     |
|---------------------|------------------|
|       ARPO S.L      |     street 1     |
|---------------------|------------------|
|       ARPO, SL      |     street 1     |
|---------------------|------------------|
|       ARPO SL       |     street 1     |
|---------------------|------------------|
|       AAAA          |     street 2     |
|---------------------|------------------|
|       AAAAAb        |     street 2     |
|---------------------|------------------|
|       AAAAAB        |     street 2     |
|---------------------|------------------|

The idea is to establish a threshold like 0,8 (or similar) to identify results that have an 80% of coincidence.

Then groupping them by 'similar_names' with dplyr library to keep only one result (row) of each group.

library (dplyr)
groups <- df %>%
  group_by(similar_names) %>%
  summarise() %>%
  arrange(name)

I tried different options with different libraries like: stringr, duplicated, adist, etc... by I didn't find a good solution.

  • Welcome to stackoverflow! Your question is unclear, please read and edit your question according to [How to make a great R reproducible example](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) so that other users can help you. Also, add expected output. – pogibas Dec 13 '19 at 13:02

2 Answers2

2

Here I have an example to input:

df <- tibble::tribble(
  ~name,       ~number,       ~ind,
  "ARPO",      "405162",      5,
  "ARPO S.L.", "504653",      22,
  "ARPOS",     "900232",      1,
  "ARPO",      "504694",      12,
  "ARPO",      "400304",      42,
  "JJJJ",      "401605",      2,
  "JJJJ",      "900029",      31,
  "BBBBB",     "400090",      25,
  "BBBBB",     "403004",      33,
  "JJJJ",      "900222",      2,
  "BBBBB",     "403967",      11,
  "BBBB",      "400304",      52,
  "JJJJ",      "404308",      200,
  "ARPO",      "403898",      2,
  "ARPO",      "158159",      24,
  "BBBBBBB",   "700805",      2,
  "ARPO S.L.", "900245",      24,
  "JJJJ",      "501486",      2,
  "JJJJ",      "400215",      210,
  "JJJJ",      "504379",      26,
  "HARPO",     "900222",      400,
  "BBBBB",     "109700",      46,
  "ARPO",      "142173",      14,
  "BBBBB",     "400586",      22,
  "ARPO",      "401605",      322
)

I found a similar solution here: Group together levels with similar names R

x <- df$name

groups <- list()
i <- 1
while(length(x) > 0) {

  id <- agrep(x[1], x, ignore.case = TRUE, max.distance = 0.1)
  groups[[i]] <- x[id]
  x <- x[-id]
  i <- i + 1

}

So, from that point, you can create a group variable:

df$group <- ""

for (j in 1:length(groups)){
  df$group <- ifelse(df$name %in% groups[[j]], paste0("group_",j), df$group)
}

Maybe you can find a simpler solution, but this works!

NoeliaNC
  • 41
  • 3
  • Thanks Noelia! 'max.distance' parameter is the threshold? I'm not too much certain how groups, after running the loop, have been created. With a max.distance of 0.1 there are 1340 groups. – kikusanchez Dec 13 '19 at 14:06
1

The function below uses agrepl to get similar strings, given a threshold thresh. And returns an integer vector of positions where the first in a group was found.

The test data is the data in NoeliaNC's answer.

library(dplyr)

similarGroups <- function(x, thresh = 0.8){
  grp <- integer(length(x))
  name <- x
  for(i in seq_along(name)){
    if(!is.na(name[i])){
      sim <- agrepl(x[i], x, ignore.case = TRUE, max.distance = 1 - thresh)
      k <- which(sim & !is.na(name))
      grp[k] <- i
      is.na(name) <- k
    }
  }
  grp
}

similarGroups(df[['name']])
# [1] 1 1 1 1 1 6 6 8 8 6 8 8 6 1 1 8 1 6 6 6 1 8 1 8 1

Now apply the function to grouping the dataframe.

df %>%
  mutate(group = name[similarGroups(name)]) %>%
  count(group)
## A tibble: 3 x 2
#  group     n
#  <chr> <int>
#1 ARPO     11
#2 BBBBB     7
#3 JJJJ      7

Edit

Another way is to use the stringsim function in package stringdist. It features several distance/similarity measures, that can be tested to see which one gives better results.

similarGroups2 <- function(x, thresh = 0.8, method = "soundex"){
  grp <- integer(length(x))
  name <- x
  x <- tolower(x)
  for(i in seq_along(name)){
    if(!is.na(name[i])){
      sim <- stringdist::stringsim(x[i], x, method = method)
      k <- which(sim > thresh & !is.na(name))
      grp[k] <- i
      is.na(name) <- k
    }
  }
  grp
}

df %>%
   mutate(group = name[similarGroups2(name, thresh = 0.7, method = "jw")]) %>%
   count(group)
## A tibble: 4 x 2
#  group             n
#  <chr>         <int>
#1 Antonio Gomez     3
#2 ARPO             11
#3 BBBBB             7
#4 JJJJ              7

New data

df <- tibble::tribble(
  ~name,       ~number,       ~ind,
  'Antonio Gomez', 1234,       1,
  'Antonio Sanches', 5678,     2,
  'Antonio Ruiz',  9089,       3,
  "ARPO",      "405162",      5,
  "ARPO S.L.", "504653",      22,
  "ARPOS",     "900232",      1,
  "ARPO",      "504694",      12,
  "ARPO",      "400304",      42,
  "JJJJ",      "401605",      2,
  "JJJJ",      "900029",      31,
  "BBBBB",     "400090",      25,
  "BBBBB",     "403004",      33,
  "JJJJ",      "900222",      2,
  "BBBBB",     "403967",      11,
  "BBBB",      "400304",      52,
  "JJJJ",      "404308",      200,
  "ARPO",      "403898",      2,
  "ARPO",      "158159",      24,
  "BBBBBBB",   "700805",      2,
  "ARPO S.L.", "900245",      24,
  "JJJJ",      "501486",      2,
  "JJJJ",      "400215",      210,
  "JJJJ",      "504379",      26,
  "HARPO",     "900222",      400,
  "BBBBB",     "109700",      46,
  "ARPO",      "142173",      14,
  "BBBBB",     "400586",      22,
  "ARPO",      "401605",      322
)
Rui Barradas
  • 70,273
  • 8
  • 34
  • 66
  • Thank you very much Rui! Now, if I want to keep only one result of each group, the idea is continue using agrepl in order to remove all results but one of each group, right? Can I ask you what would be your approach? Thanks! – kikusanchez Dec 16 '19 at 08:20
  • In this case, the function fails because it groups by 100% concidence people with the same first name: antonio, antonio sanchez, antonio ruiz, antonio gomez (names are unreal because rights protection). How you would manage with this? – kikusanchez Dec 16 '19 at 12:21
  • Rui, in your edit you call the function SimilarGroups again or is a typing error and you meant SimilarGroups2? df %>% mutate(group = name[similarGroups(name, thresh = 0.7, method = "jw")]) %>% count(group) – kikusanchez Dec 16 '19 at 14:29
  • @kikusanchez You are right, sorry. Will edit, it's `similarGroups2`. – Rui Barradas Dec 16 '19 at 16:51