-1

My data includes a Name column. Some names are written in upto eight different ways. I tried grouping them with the following code:

groups <- list()
i <- 1
while(length(x) > 0)
{
  id <- agrep(x[1], x, ignore.case = TRUE, max.distance = 0.1)
  groups[[i]] <- x[id]
  x <- x[-id]
  i <- i + 1
}

head(groups)
groups

Next, I want to add a new column that returns the, for example, most commonly used notation of a name for each row. The result should look like:

      A            B
1. John Snow    John Snow
2. Peter Wright Peter Wright
3. john snow    John Snow
4. John snow    John Snow
5. Peter wright Peter Wright
6. J. Snow      John Snow
7. John Snow    John Snow
etc.

How can I get there?

Sotos
  • 51,121
  • 6
  • 32
  • 66

1 Answers1

2

This answer is heavily based on a previous question/answer which put strings into groups. This answer merely adds finding the mode for each group and assigning the right mode to the original strings.

## The data
Names = c("John Snow", "Peter Wright",  "john snow",
    "John snow", "Peter wright", "J. Snow", "John Snow")

## Grouping like in the previous question
groups <- list()
i <- 1
x = Names
while(length(x) > 0)
{
  id <- agrep(x[1], x, ignore.case = TRUE, max.distance = 0.25)
  groups[[i]] <- x[id]
  x <- x[-id]
  i <- i + 1
}

## Find the mode for each group
Modes = sapply(groups, function(x) names(which.max(table(x))))

## Assign the correct mode to each string
StandardName = rep("", length(Names))
for(i in seq_along(groups)) {
    StandardName[Names %in% groups[[i]]] = Modes[i]
}

StandardName
[1] "John Snow"    "Peter wright" "John Snow"    "John Snow"    "Peter wright"
[6] "John Snow"    "John Snow" 

You will likely need to experiment with the right value of the max.distance argument to agrep.

If you want to add the answer to the data.frame, just add

df$StandardName = StandardName

To write the result so that it is accessible from Excel, use

write.csv(df, "MyData.csv")
G5W
  • 36,531
  • 10
  • 47
  • 80
  • Thanks for your reply! Now, for the final step I would like to know how I can add the results to the dataframe and write it back to Excel (I'm new to R). – E. van Dongen Nov 18 '19 at 15:02
  • @ G5W, again thanks for your reply. What happens now is that the answer is only added about 20% of the time. I think this is caused by the name variable being a factor. Althoughy I have 7.000 rows, there are only ~1.900 factors, as a lot of notations are used multiple times. Is there a way to work around? – E. van Dongen Nov 20 '19 at 10:02
  • If the name variable is a factor, I would recommend changing it to a string before processing. `Names = as.character(Names)` – G5W Nov 20 '19 at 16:01