Identifying unique
values is straight forward when the data is well behaved. Here I am looking for an approach to get a list of approximately unique values from a character vector.
Let x
be a vector with slightly different names for an entity, e.g. Kentucky loader
may appear as Kentucky load
or Kentucky loader (additional info)
or somewhat similar.
x <- c("Kentucky load" ,
"Kentucky loader (additional info)",
"CarPark Gifhorn (EAP)",
"Car Park Gifhorn (EAP) new 1.5.2012",
"Center Kassel (neu 01.01.2014)",
"HLLS Bremen (EAP)",
"HLLS Bremen (EAP) new 06.2013",
"Hamburg total sum (abc + TBL)",
"Hamburg total (abc + TBL) new 2012")
What I what to get out is something like:
c("Kentucky loader" ,
"Car Park Gifhorn (EAP)",
"Center Kassel (neu 01.01.2014)",
"HLLS Bremen (EAP)",
"Hamburg total (abc + TBL)")
Idea
- Calculate some similarity measure between all strings (e.g. Levenshtein distance)
- Use longest common subset method
- Somehow :( decide which strings belong together based on this information.
But I guess this will be a standard task (for those R users working with "dirty" data regularly), so I assume there will be a set of standard approaches to it.
Does someone have a hint or is there a package that does this?