I'm new here in more than one sense. First post regarding my first script in my first attempt of aquainting any programming language. In the light of that you might find this project to be overly ambitious, but hey, learning by doing has always been the way to go. I'm doing my best to meet stackoverflow-etiquette here, but let me know if I'm in violation of anything.
I wanted to write a piece of code that can apply some kind of fuzzy logic in matching a table of unstructured company names (e.g. Google) with a table of structured company names (e.g. Google Inc.) and Danish company identifiers (CVR
).
I was able to find some bits of code by googling around, and I managed to manipulate them to work with my project. I found that the Jaro-Winkler algorithm that is contained withing the stringdist
package works particularly well with company names. The script works perfectly fine when trying to compare and match 40 unstructured company names against a few hundred structured names, but I need to compare and match around 4000 unstructured names with a table containing 700k structured names. As you might have guessed this takes forever. To give you an idea, I've tried matching 6 unstructured names up against the 700k, which took three hours. A quick calculation tells me that if this is the average speed of the script, it will make me almost 3 months to process 4000 companies, which is a bit overwhelming. I understand that it has to do several billion calculations and that this cannot be done in a few minutes. If I could however minimize this to maybe just a few days I would be more than happy, and I feel like this must be possible.
So, I'm looking for methods to speed up this piece of code. I've already managed to improve it some by initially pairing up exact matches with the match()
function, which leaves around 500 companies for further processing with the fuzzy matching algorithm. Still, that takes a long time to say the least.
I hope I managed to explaing myself clearly! Any suggestions will be highly appreciated.
library(stringdist)
#Reading the two files to be compared and making sure that I'm dealing with characters
companies.unstructured <- read.csv(
"https://www.dropbox.com/s/opbk0s2q14l5c71/unstructured_companies.csv?dl=0",
sep = ";",
stringsAsFactors = FALSE
)
companies.structured <- read.csv(
"https://www.dropbox.com/s/kyi0rvz77frr7sd/structured_companies_w_CVR.csv?dl=0",
sep=";",
stringsAsFactors = FALSE
)
#Using the match function to match up all 100% identical companies to avoid unnecessary workload for the Jaro-Winkler loop
companies.unstructured$CVR = companies.structured$CVR[match(companies.unstructured$Company,
companies.structured$Company)]
companies.exact.match <- companies.unstructured[!is.na(companies.unstructured$CVR), ]
#Creating a subset to work on with the Jaro-Winkler loop.
companies.unstructured.na <- subset(companies.unstructured, is.na(CVR))
#And here's the loop measuring the distance between the company names using the Jaro-Winkler algorithm.
distance.methods<- c('jw')
dist.methods<-list()
for(m in 1:length(distance.methods))
{
dist.name.enh<-matrix(NA, ncol = length(companies.structured$Company),
nrow = length(companies.unstructured.na$Company))
for(i in 1:length(companies.structured$Company)) {
for(j in 1:length(companies.unstructured.na$Company)) {
dist.name.enh[j,i]<-stringdist(tolower(companies.structured[i,]$Company),
tolower(companies.unstructured.na[j,]$Company),
method = distance.methods[m])
}
}
dist.methods[[distance.methods[m]]]<-dist.name.enh
}
#matching up pairs of minimum distance
match.s1.s2.enh<-NULL
for(m in 1:length(dist.methods))
{
dist.matrix<-as.matrix(dist.methods[[distance.methods[m]]])
min.name.enh<-apply(dist.matrix, 1, base::min)
for(i in 1:nrow(dist.matrix))
{
s2.i<-match(min.name.enh[i],dist.matrix[i,])
s1.i<-i
match.s1.s2.enh<-rbind(data.frame(s2.i=s2.i,
s1.i=s1.i,
s1Company=companies.unstructured.na[s1.i,]$Company,
s2Company=companies.structured[s2.i,]$Company,
CVR=companies.structured[s2.i,]$CVR,
adist=min.name.enh[i],
method=distance.methods[m]),
match.s1.s2.enh)
}
}
EDIT: Here's some data examples to work with: structured_companies_w_CVR and unstructured_companies