Getting approximately unique values from character vector

Question

Identifying unique values is straight forward when the data is well behaved. Here I am looking for an approach to get a list of approximately unique values from a character vector.

Let x be a vector with slightly different names for an entity, e.g. Kentucky loader may appear as Kentucky load or Kentucky loader (additional info) or somewhat similar.

x <- c("Kentucky load" ,                                                                                                            
       "Kentucky loader (additional info)",                                                                                     
       "CarPark Gifhorn (EAP)",
       "Car Park  Gifhorn (EAP) new 1.5.2012",
       "Center Kassel (neu 01.01.2014)",
       "HLLS Bremen (EAP)",
       "HLLS Bremen (EAP) new 06.2013",
       "Hamburg total sum (abc + TBL)",
       "Hamburg total (abc + TBL) new 2012")

What I what to get out is something like:

c("Kentucky loader" ,                                                                                                            
  "Car Park Gifhorn (EAP)",
  "Center Kassel (neu 01.01.2014)",
  "HLLS Bremen (EAP)",
  "Hamburg total (abc + TBL)")

Idea

Calculate some similarity measure between all strings (e.g. Levenshtein distance)
Use longest common subset method
Somehow :( decide which strings belong together based on this information.

But I guess this will be a standard task (for those R users working with "dirty" data regularly), so I assume there will be a set of standard approaches to it.

Does someone have a hint or is there a package that does this?

You have basically outlined the "standard" approach. There is no magic bullet for cleaning text data like this. You do the best you can with the tools you already seem to know about, but then you'll have to manually review it. — joran, Dec 28 '15 at 20:25
Not R, but [OpenRefine](http://openrefine.org/) is specifically developed for such a task. — Jaap, Dec 28 '15 at 20:51

score 2 · Accepted Answer · edited Jun 20 '20 at 09:12

As @Jaap said, try playing with OpenRefine. The data carpentry course is pretty good.

If you do want to stay in R, here's a solution for your example, using agrepl:

z <- sapply(x, function(z) agrepl(z, x, max.distance = 0.2))
apply(z, 1, function(myz) x[myz][which.min(nchar(x[myz]))])

Which gives the smallest match in chars found for each member of x:

[1] "Kentucky load"                  "Kentucky load"                  "CarPark Gifhorn (EAP)"         
[4] "CarPark Gifhorn (EAP)"          "Center Kassel (neu 01.01.2014)" "HLLS Bremen (EAP)"             
[7] "HLLS Bremen (EAP)"              "Hamburg total sum (abc + TBL)"  "Hamburg total sum (abc + TBL)"

This is good if you want to keep order of your vector to match others (or use on a column of a dataframe).

You can call unique on this output to get your desired output.

Getting approximately unique values from character vector

1 Answers1