3

I'm new here in more than one sense. First post regarding my first script in my first attempt of aquainting any programming language. In the light of that you might find this project to be overly ambitious, but hey, learning by doing has always been the way to go. I'm doing my best to meet stackoverflow-etiquette here, but let me know if I'm in violation of anything.

I wanted to write a piece of code that can apply some kind of fuzzy logic in matching a table of unstructured company names (e.g. Google) with a table of structured company names (e.g. Google Inc.) and Danish company identifiers (CVR).

I was able to find some bits of code by googling around, and I managed to manipulate them to work with my project. I found that the Jaro-Winkler algorithm that is contained withing the stringdist package works particularly well with company names. The script works perfectly fine when trying to compare and match 40 unstructured company names against a few hundred structured names, but I need to compare and match around 4000 unstructured names with a table containing 700k structured names. As you might have guessed this takes forever. To give you an idea, I've tried matching 6 unstructured names up against the 700k, which took three hours. A quick calculation tells me that if this is the average speed of the script, it will make me almost 3 months to process 4000 companies, which is a bit overwhelming. I understand that it has to do several billion calculations and that this cannot be done in a few minutes. If I could however minimize this to maybe just a few days I would be more than happy, and I feel like this must be possible.

So, I'm looking for methods to speed up this piece of code. I've already managed to improve it some by initially pairing up exact matches with the match() function, which leaves around 500 companies for further processing with the fuzzy matching algorithm. Still, that takes a long time to say the least.

I hope I managed to explaing myself clearly! Any suggestions will be highly appreciated.

library(stringdist)

#Reading the two files to be compared and making sure that I'm dealing with characters
companies.unstructured <- read.csv(
  "https://www.dropbox.com/s/opbk0s2q14l5c71/unstructured_companies.csv?dl=0", 
  sep = ";", 
  stringsAsFactors = FALSE
)
companies.structured <- read.csv(
  "https://www.dropbox.com/s/kyi0rvz77frr7sd/structured_companies_w_CVR.csv?dl=0", 
  sep=";", 
  stringsAsFactors = FALSE
)

#Using the match function to match up all 100% identical companies to avoid unnecessary workload for the Jaro-Winkler loop
companies.unstructured$CVR = companies.structured$CVR[match(companies.unstructured$Company, 
                                                            companies.structured$Company)]
companies.exact.match <- companies.unstructured[!is.na(companies.unstructured$CVR), ]

#Creating a subset to work on with the Jaro-Winkler loop.
companies.unstructured.na <- subset(companies.unstructured, is.na(CVR))

#And here's the loop measuring the distance between the company names using the Jaro-Winkler algorithm.
distance.methods<- c('jw')
dist.methods<-list()
for(m in 1:length(distance.methods))
{
  dist.name.enh<-matrix(NA, ncol = length(companies.structured$Company),
                        nrow = length(companies.unstructured.na$Company))
  for(i in 1:length(companies.structured$Company)) {
    for(j in 1:length(companies.unstructured.na$Company)) { 
      dist.name.enh[j,i]<-stringdist(tolower(companies.structured[i,]$Company),
                                     tolower(companies.unstructured.na[j,]$Company),
                                     method = distance.methods[m])      
    }  
  }
  dist.methods[[distance.methods[m]]]<-dist.name.enh
}

#matching up pairs of minimum distance
match.s1.s2.enh<-NULL
for(m in 1:length(dist.methods))
{

  dist.matrix<-as.matrix(dist.methods[[distance.methods[m]]])
  min.name.enh<-apply(dist.matrix, 1, base::min)
  for(i in 1:nrow(dist.matrix))
  {
    s2.i<-match(min.name.enh[i],dist.matrix[i,])
    s1.i<-i
    match.s1.s2.enh<-rbind(data.frame(s2.i=s2.i,
                                      s1.i=s1.i,
                                      s1Company=companies.unstructured.na[s1.i,]$Company,
                                      s2Company=companies.structured[s2.i,]$Company,
                                      CVR=companies.structured[s2.i,]$CVR,
                                      adist=min.name.enh[i],
                                      method=distance.methods[m]),
                           match.s1.s2.enh)
  }
}

EDIT: Here's some data examples to work with: structured_companies_w_CVR and unstructured_companies

Morten Nielsen
  • 325
  • 2
  • 4
  • 19
  • Please provide a reproducible example. Use stringsAsFactors = FALSE when reading the data. Pass vectors to the functions instead of looping. – Thierry Oct 05 '15 at 08:56
  • Thanks. By reproducable, you mean you want me to include the two csv files? – Morten Nielsen Oct 05 '15 at 09:01
  • At least a small sample of similar data. Provide everything we need to run your code. – Thierry Oct 05 '15 at 09:05
  • Got you. I edited the orignal post so that it now includes two links – Morten Nielsen Oct 05 '15 at 09:14
  • I've added to links to your code. They fail to work because you saved to rownames to the csv files. Either remove the rownames column or give it a name. – Thierry Oct 05 '15 at 11:32
  • I miss companies.unstructured.na$adresse.Company in your data example. The loop will run from 1 to 0 where this is used. – phiver Oct 05 '15 at 13:18
  • @phiver: You're right, this is an error. Before posting I translated my original script with Danish variable names to English to make things more clear, but I missed that one. It should be companies.unstructured.na$Company, which is now edited into the code above. – Morten Nielsen Oct 05 '15 at 14:15
  • @Thierry: Thanks - I've deleted the column with row names now. – Morten Nielsen Oct 05 '15 at 14:15
  • Reading the csv files still fail on my machine. Do they work on your machine? An other option is that you save the dataframe `companies.unstructured` and `companies.structured` to a file and put that online. `save(companies.unstructured, companies.structured, file = "company.rda")` – Thierry Oct 06 '15 at 07:46
  • Strange. I can't read the files from the dropbox, but I can read the same exact same files from my hd with no problems. Does it work for you if you download the files from the dropbox, and direct the read.csv() to your hd instead? – Morten Nielsen Oct 06 '15 at 09:57

1 Answers1

3

I profiled your code and found some speedups. I kept as much as possible to your naming conventions so you can match the differences. I saved the files in my working directory for testing purposes.

  1. Created an empty dataframe based on the columns you need and the records you need. In the loop you update a record instead of using cbind. This speeds up the code quite a bit. I kept getting a system.time of 0. Because R doesn't know the size of the dataframe it makes constant copies with rbind and tends to slow down the process if you have a lot of rows. See also this post. Updating the records is a lot faster even if the dataframe is bigger then you need.

    edit: I managed to remove everything except the match function from the loop and the rest for the dataframe can be done with vectors / input from other parts of data already available.

  2. I added a parallel option in the code and used stringdistmatrix. This function runs in parallel if available, but also you do not need any loop for the distance calculation.

code section:

library(stringdist)
library(parallel)


#Reading the two files to be compared and making sure that I'm dealing with characters
companies.unstructured <- read.csv("unstructured_companies.csv", 
                                   sep = ";", 
                                   stringsAsFactors = FALSE)
companies.structured <- read.csv("structured_companies_w_CVR.csv", 
                                 sep=";",
                                 stringsAsFactors = FALSE)

#Using the match function to match up all 100% identical companies to avoid unnecessary workload for the Jaro-Winkler loop
companies.unstructured$CVR <- companies.structured$CVR[match(companies.unstructured$Company, 
                                                             companies.structured$Company)]
companies.exact.match <- companies.unstructured[!is.na(companies.unstructured$CVR), ]

#Creating a subset to work on with the Jaro-Winkler loop.
companies.unstructured.na <- subset(companies.unstructured, is.na(CVR))

distance.method <- "jw"

# Parallel section starts here
# set number of cores to use. 
cores = 3
# initialize cluster
cl = makeCluster(cores, type = "SOCK")


# create distance matrix, shortest column will be recycled. 
# See stringdistmatrix documentation
dist.name.enh <- stringdistmatrix(tolower(companies.structured$Company),
                                  tolower(companies.unstructured.na$Company),
                                  method = distance.method,
                                  nthread = getOption("sd_num_thread"))

# get the minimun jaro distances from the matrix
min.name.enh <- parApply(cl, dist.name.enh, 2, base::min)

# stop the cluster
stopCluster(cl)
# Parallel section ends here

# create dataframe prefilled with empty values.
match.s1.s2.enh2 <- data.frame(s2.i = rep(NA, nrow(companies.unstructured.na)),
                               s1.i = rep(NA, nrow(companies.unstructured.na)),
                               s1Company = rep(NA, nrow(companies.unstructured.na)),
                               s2Company = rep(NA, nrow(companies.unstructured.na)),
                               CVR = rep(NA, nrow(companies.unstructured.na)),
                               adist = rep(NA, nrow(companies.unstructured.na)),
                               method = rep(NA, nrow(companies.unstructured.na)))

# fill s2.i with NA values for the length needed in the for loop
s2.i <- rep(NA, ncol(dist.name.enh))

# matching up pairs of minimum distance.
for(i in 1:ncol(dist.name.enh)) {
  s2.i[i]<-match(min.name.enh[i],dist.name.enh[,i])
}

match.s1.s2.enh2$s2.i <- s2.i
match.s1.s2.enh2$s1.i <- 1:ncol(dist.name.enh)
match.s1.s2.enh2$s1Company <- companies.unstructured.na$Company
match.s1.s2.enh2$adist <- min.name.enh
match.s1.s2.enh2$method <- distance.method
match.s1.s2.enh2$s2Company <- companies.structured$Company[s2.i] 
match.s1.s2.enh2$CVR <- companies.structured$CVR[s2.i] 
Community
  • 1
  • 1
phiver
  • 23,048
  • 14
  • 44
  • 56
  • Thanks a lot @phiver. I just ran a test on this, trying to match 6 companies up against 700k (as in my initial example). This time around it took the script about 1½ hr, so there's definitely a big improvement there. Still, as far as I can tell, it will take the script around 5 days to process 500 companies, which is closer to what I want to be able to do. Thierry initially mentioned trying to pass vectors to the function instead of the loop. Do you reckon that would be an improvement as well? Not quite sure how to go about it just yet, but I could try. – Morten Nielsen Oct 07 '15 at 15:22
  • Thanks! I feel like it's almost there, but i get an error for the '#matching up pairs of minimum distance' part: Error in s2.i[i] <- match(min.name.enh[i], dist.name.enh[, i]) : object 's2.i' not found. I can't seem to figure out what causes this. Any idea? – Morten Nielsen Oct 12 '15 at 07:44
  • I added the offending line. Have to specify s2.i before the loop. – phiver Oct 12 '15 at 08:06
  • Amazing. This time around it took me 7 minutes. Very helpful! – Morten Nielsen Oct 12 '15 at 09:19