0

I have daily data with different importer names (more than 100 thousands of observations).

My data set contains more than 100 thousand records and most likely there will be 10,000 importers that occur in the data set multiple times, but with slightly different spelled names.

Does anyone of you have any experience with checking for similarity within the same variable and replace the importer name with one unique name/codes and may I ask what code I should use?

For example

importers <- tibble(V1 = c("3M",
  "3M Company",
  "3M Co",
  "A & R LOGISTICS INC",
  "AR LOGISTICS INC",
  "A & R LOGISTICS LTD",
  "ABB GROUP",
  "ABB LTD",
  "ABB INC"))

I want a column next to V1; as V2 with a unique name/code for the similar names.

Ben Bolker
  • 211,554
  • 25
  • 370
  • 453
Rezoan
  • 19
  • 4
  • I'm very interested in the responses to this. You could try something with [clustering based on Levenshtein distances](https://stackoverflow.com/questions/57701273/clustering-in-r-levenshtein-distance) (or see [here](https://stackoverflow.com/questions/21511801/text-clustering-with-levenshtein-distances)). It could be hard to do in a completely automated way. [OpenRefine](https://openrefine.org/) is a useful tool for doing it in a human-assisted way (and building a set of rules that can be applied to new data automatically). – Ben Bolker Jul 22 '20 at 17:13
  • I checked your links, quite difficult to catch those codes for someone novice like me. Any better alternative codes do you suggest? – Rezoan Jul 22 '20 at 19:12

1 Answers1

0

This may not be satisfactory, but it could give you a start. Most of the solution here is copied from this answer, with some additional explanation.

The hard part is that at some point you have to make a subjective decision about how close two labels should be in order to count as a cluster, and this will be hard to do for a large data set.

First we compute string distances:

s <- importers$V1  ## for brevity
d  <- adist(s)     ## compute Levenshtein distances
dimnames(d) <- list(s,s)

Visualize (obviously impractical for a huge set of names ...)

par(mar=c(6,1,1,6))
heatmap(d,Rowv=NA,Colv=NA,margins=c(12,12))

enter image description here

A human can easily tell that there are three clusters here. However, there's no easy cutoff in terms of string distance:

par(las=1)
plot(table(d),xlab="string distance",ylab="frequency")

enter image description here

The within-cluster and between-cluster distance distributions overlap ...

Now we do hierarchical clustering:

hc <- hclust(as.dist(d))
plot(hc)
rect.hclust(hc,k=3)  ## select 3 clusters

enter image description here

Once we decide that there are 3 clusters here, the clustering algorithm selects the "correct" elements for each cluster.

Create a new column giving the cluster identity for each row:

importers$code <- cutree(hc,k=3)

As I suggested in comments, it might be better to do this job in OpenRefine: it could be hard to write a reliable, robust, completely automated method for doing this task.

Also: I don't know how badly this will scale to a data set of 10,000 names. Hierarchical clustering is fast, but the distance matrix will be huge (50 million entries), which will take time to compute and space to store. (There are faster ways to compute Levenshtein distance than the built-in adist() ...)

A few suggestions for making the problem more computationally tractable (although it won't be easy in any case):

  • you definitely shouldn't try to do the clustering on the full data set. Instead, extract the vector of unique importer names, cluster them, then join (merge) them back with the full data set
  • you might be able to do the problem with this (inefficient) batch algorithm:
    • split the data into subsets of importers (it will probably work best if you alphabetize the vector first); cluster each of these, reduce them to the consensus names within each cluster
    • join the subsets and re-cluster
Ben Bolker
  • 211,554
  • 25
  • 370
  • 453
  • Thank you so much. when I run the codes for 100 thousand observations, it says too large to compute. – Rezoan Jul 24 '20 at 19:56