I want to group_by similar results (not unique) and I don't know how to do it.
I mean, I have a df with a column called 'name' that has similar results like: ARPO, ARPO S.L, ARPO, SL, etc.
|---------------------|------------------|
| name | address |
|---------------------|------------------|
| ARPO | street 1 |
|---------------------|------------------|
| ARPO S.L | street 1 |
|---------------------|------------------|
| ARPO, SL | street 1 |
|---------------------|------------------|
| ARPO SL | street 1 |
|---------------------|------------------|
| AAAA | street 2 |
|---------------------|------------------|
| AAAAAb | street 2 |
|---------------------|------------------|
| AAAAAB | street 2 |
|---------------------|------------------|
The idea is to establish a threshold like 0,8 (or similar) to identify results that have an 80% of coincidence.
Then groupping them by 'similar_names' with dplyr library to keep only one result (row) of each group.
library (dplyr)
groups <- df %>%
group_by(similar_names) %>%
summarise() %>%
arrange(name)
I tried different options with different libraries like: stringr, duplicated, adist, etc... by I didn't find a good solution.