Successively agrep names in a variable, then create a new variable with the shortest name for close matches

Question

Assume a character vector of company names where the names come in various forms. Here is a small version of 10,000 row data frame; it shows the desired second vector ("two.names").

structure(list(firm = structure(1:8, .Label = c("Carlson Caspers", 
"Carlson Caspers Lindquist & Schuman P.A", "Carlson Caspers Vandenburgh  Lindquist & Schuman P.A.", 
"Carlson Caspers Vandenburgh & Lindquist", "Carmody Torrance", 
"Carmody Torrance et al", "Carmody Torrance Sandak", "Carmody Torrance Sandak & Hennessey LLP"
), class = "factor"), two.name = structure(c(1L, 1L, 1L, 1L, 
2L, 2L, 2L, 2L), .Label = c("Carlson Caspers", "Carmody Torrance"
), class = "factor")), .Names = c("firm", "two.name"), row.names = c(NA, 
-8L), class = "data.frame")


                                               firm         two.name
1                                       Carlson Caspers  Carlson Caspers
2               Carlson Caspers Lindquist & Schuman P.A  Carlson Caspers
3 Carlson Caspers Vandenburgh  Lindquist & Schuman P.A.  Carlson Caspers
4               Carlson Caspers Vandenburgh & Lindquist  Carlson Caspers
5                                      Carmody Torrance Carmody Torrance
6                                Carmody Torrance et al Carmody Torrance
7                               Carmody Torrance Sandak Carmody Torrance
8               Carmody Torrance Sandak & Hennessey LLP Carmody Torrance

Assume the vector has been sorted alphabetically by firm name (which I believe puts the shortest version first). How can I use agrep() to start with the first company name, match it to the second and -- assuming a close match -- add the first company name to the new column (short.name) for both of them. Then, match it to the third element, etc. All the Carlson variations would be matched.

If there is not a sufficient match, as when R encounters the first Carmody, start over with it and match to the next element, and so on until the next non-match.

If there is no match between consecutive companies, R should proceed until it finds a match.

The answer to this question uses fuzzy matching on the entire vector and groups by years. Create a unique ID by fuzzy matching of names (via agrep using R) It seems, however, to offer part of the code that would solve my problem. This question uses stringdist(). stringdist

EDIT:

Below, the object matches is a list that shows matches, but I don't know the code to tell R to "take the first one and convert the following matches, if any, to that name and put that name in the new variable column."

as.factor(df$firm)
matches <- lapply(levels(df$firm), agrep, x=levels(df$firm), fixed=TRUE, value=FALSE)

It seems like you are looking for a complete solution. Have you tried some approaches yourself that you have found not working? — LauriK, Jan 21 '15 at 15:12
@LauriK: I tried to use Reduce to successively agrep, but I failed. I don't know how to move "down" a vector. In short, how do I even start? — lawyeR, Jan 21 '15 at 17:24
Write a for-loop to go through the vector first, make that solution work on a small dataset. Then if you need to vectorize it or make it faster, start working towards a more complex solution, but at least you have a correct implementation to compare to first. — LauriK, Jan 21 '15 at 21:12
@LauriK: I edited my question with a starting point, perhaps at the same time you wrote the above comment. I did start with my mini-data frame, but you can see the point where I don't know what else to do. — lawyeR, Jan 21 '15 at 21:51

score 0 · Accepted Answer · answered Jan 22 '15 at 09:59

0

I went and wrote it out in a for-loop, first defining the first line as a short.name and then finding the matches, updating the dataframe and picking the next one to look for. That's what I meant by "do not try to solve this with a one-liner" - you have to make it work first in a much more verbose way, so you can understand what's going on. Then and ONLY if you NEED to, you can try to compress it into a oneliner.

firm.txt <- as.character(df$firm)
short.name <- firm.txt[1]
for (i in 2:length(firm.txt)) {
  # i don't know how to write it any prettier
  match <- agrep(short.name, firm.txt)
  if (length(match) > 0) {
    df$two.name[match] <- short.name
    i <- max(match) + 1
    short.name <- firm.txt[i]
  }
}

answered Jan 22 '15 at 09:59

LauriK

1,899
15
20

I accepted this, but later tried it with a bit longer set of names and it failed. You went a long way, LauriK, and certainly achieved much more than I could, but your code seems to fill in all agrep matches of the first name -- firm.txt[1] -- but does not go down through the vector handling each name in turn. Thus 10 rows are left out: Error in `$<-.data.frame`(`*tmp*`, "two.name", value = c("Carlson Caspers", : replacement has 7 rows, data has 17 – lawyeR Jan 23 '15 at 02:56
Yeah, I didn't think it scales very well or holds up with different data very well, but it's a start. You can go through this script line by line and see what values different variables have and then figure out how you want to change it. – LauriK Jan 23 '15 at 12:26

Successively agrep names in a variable, then create a new variable with the shortest name for close matches

1 Answers1