8

Using R, I am trying match on people's names in a dataset structured by year and city. Due to some spelling mistakes, exact matching is not possible, so I am trying to use agrep() to fuzzy match names.

A sample chunk of the dataset is structured as follows:

df <- data.frame(matrix( c("1200013","1200013","1200013","1200013","1200013","1200013","1200013","1200013",                             "1996","1996","1996","1996","2000","2000","2004","2004","AGUSTINHO FORTUNATO FILHO","ANTONIO PEREIRA NETO","FERNANDO JOSE DA COSTA","PAULO CEZAR FERREIRA DE ARAUJO","PAULO CESAR FERREIRA DE ARAUJO","SEBASTIAO BOCALOM RODRIGUES","JOAO DE ALMEIDA","PAULO CESAR FERREIRA DE ARAUJO"), ncol=3,dimnames=list(seq(1:8),c("citycode","year","candidate")) ))

The neat version:

  citycode year                      candidate
1  1200013 1996      AGUSTINHO FORTUNATO FILHO
2  1200013 1996           ANTONIO PEREIRA NETO
3  1200013 1996         FERNANDO JOSE DA COSTA
4  1200013 1996 PAULO CEZAR FERREIRA DE ARAUJO
5  1200013 2000 PAULO CESAR FERREIRA DE ARAUJO
6  1200013 2000    SEBASTIAO BOCALOM RODRIGUES
7  1200013 2004                JOAO DE ALMEIDA
8  1200013 2004 PAULO CESAR FERREIRA DE ARAUJO

I'd like to check in each city separately, whether there are candidates appearing in several years. E.g. in the example,

PAULO CEZAR FERREIRA DE ARAUJO

PAULO CESAR FERREIRA DE ARAUJO

appears twice (with a spelling mistake). Each candidate across the entire data set should be assigned a unique numeric candidate ID. The dataset is fairly large (5500 cities, approx. 100K entries) so a somewhat efficient coding would be helpful. Any suggestions as to how to implement this?

EDIT: Here is my attempt (with help from the comments thus far) that is very slow (inefficient) in achieving the task at hand. Any suggestions as to improvements to this?

f <- function(x) {matches <- lapply(levels(x), agrep, x=levels(x),fixed=TRUE, value=FALSE)
                  levels(x) <- levels(x)[unlist(lapply(matches, function(x) x[1]))]
                  x
                }

temp <- tapply(df$candidate, df$citycode, f, simplify=TRUE)
df$candidatenew <- unlist(temp)
df$spellerror <- ifelse(as.character(df$candidate)==as.character(df$candidatenew), 0, 1)

EDIT 2: Now running at good speed. Problem was the comparison to many factors at every step (Thanks for pointing that out, Blue Magister). Reducing the comparison to only the candidates in one group (i.e. a city) runs the command in 5 seconds for 80,000 lines - a speed I can live with.

df$candidate <- as.character(df$candidate)

f <- function(x) {x <- as.factor(x)
                  matches <- lapply(levels(x), agrep, x=levels(x),fixed=TRUE, value=FALSE)
                  levels(x) <- levels(x)[unlist(lapply(matches, function(x) x[1]))]
                  as.character(x)
                }

temp <- tapply(df$candidate, df$citycode, f, simplify=TRUE)
df$candidatenew <- unlist(temp)
df$spellerror <- ifelse(as.character(df$candidate)==as.character(df$candidatenew), 0, 1)
Community
  • 1
  • 1
thomasB
  • 303
  • 3
  • 11
  • What have you tried so far? Is the problem with getting `agrep` to match or doing it efficiently? – Ari B. Friedman Oct 21 '12 at 16:41
  • 1
    Mostly the efficiency part. I ran a loop through all cities and candidates but that takes quite a long time. Am able to find the fuzzy matches within each city but had a hard time making a unique ID across the entire data set. – thomasB Oct 21 '12 at 16:46
  • Can you post the code for your loop? Also see http://stackoverflow.com/questions/2908822/speed-up-the-loop-operation-in-r/8474941#8474941 . – Ari B. Friedman Oct 21 '12 at 16:49
  • My loop was || for (i in 1:dim(df)[1]){ df$match[i] = sort(agrep(df$candidate[i], df$candidate, ignore.case = FALSE, value = TRUE, max.distance = 0.1))[1]} df$candid <- as.numeric(as.factor(paste(df$match,"-",data$citycode,sep=""))) || And then repeated (through a loop) for each city). – thomasB Oct 21 '12 at 17:20

2 Answers2

4

Here's my shot at it. It's probably not very efficient, but I think it will get the job done. I assume that df$candidates is of class factor.

#fuzzy matches candidate names to other candidate names
#compares each pair of names only once
##by looking at names that have a greater index
matches <- unlist(lapply(1:(length(levels(df[["candidate"]]))-1),
    function(x) {max(x,x + agrep(
        pattern=levels(df[["candidate"]])[x], 
        x=levels(df[["candidate"]])[-seq_len(x)]
    ))}
))
#assigns new levels (omits the last level because that doesn't change)
levels(df[["candidate"]])[-length(levels(df[["candidate"]]))] <- 
    levels(df[["candidate"]])[matches]
Blue Magister
  • 13,044
  • 5
  • 38
  • 56
  • First part is great. The second part does not seem run. In any case, thank you for getting me started. Perhaps you have a suggestion as to how to improve the "solution" I added to the question. – thomasB Oct 22 '12 at 18:13
  • I've updated with something I think is a bit better. You won't have to `agrep` through as many elements. It works on the sample dataset, but I"m unsure how well it will work on your larger dataset. – Blue Magister Oct 23 '12 at 01:10
  • Thank you very much. The comparison to many levels was the problem indeed. Now, I restricted the comparison to levels within each group and the speed is hundreds of times faster. – thomasB Oct 23 '12 at 09:13
  • If you permit, one last question regarding your previous solution. Can I add some options to the agrep command in the line: lapply(levels(x), agrep, x=levels(x),fixed=TRUE, value=FALSE) I would like to specify "max.distance". – thomasB Oct 23 '12 at 09:15
  • 1
    Yes: `lapply(levels(x), agrep, x=levels(x),fixed=TRUE, value=FALSE, max.distance = .2)`. (Note you can format text to look like code with the backquote `\``.) – Blue Magister Oct 23 '12 at 12:33
3

Ok, given that the focus is on the efficiency, I'd suggest the following.

First, note that in order of efficiency from first principles we could predict that exact matching will be much faster than grep which will be faster than fuzzy grep. So exact match, then fuzzy grep for the remaining observations.

Second, vectorize and avoid loops. The apply commands aren't necessarily faster, so stick to native vectorization if you can. All the grep commands are natively vectorized, but it's going to be hard to avoid a *ply or loop to compare each element to the vector of others to match to.

Third, use outside information to narrow the problem down. Do fuzzy matching on names only within each city or state, which will dramatically reduce the number of comparisons which must be made, for instance.

You can combine the first and third principles: You might even try exact matching on the first character of each string, then fuzzy matching within that.

Ari B. Friedman
  • 71,271
  • 35
  • 175
  • 235