3

I would like to know how to match postal addresses when their format differ or when one of them is mispelled.

So far I've found different solutions but I think that they are quite old and not very efficient. I'm sure some better methods exist, so if you have references for me to read, I'm sure it is a subject that may interest several persons.

The solutions I found (examples are in R) :

  • Levenshtein distance, which equals the number of characters you have to insert, delete or change to transform one word into another.

    agrep("acusait", c("accusait", "abusait"), max = 2, value = TRUE) ## [1] "accusait" "abusait"

  • The comparison of phonemes

    library(RecordLinkage) soundex(x<-c('accusait','acusait','abusait')) ## [1] "A223" "A223" "A123"

  • The use of a spelling corrector (eventually a bayesian one like Peter Norvig's), but not very efficient on addresses I guess.

  • I thougt about using the suggestions of Google suggest, but likewise, it is not very efficient on personal postal addresses.

  • You can imagine using a machine learning supervised approach but you need to have stored the mispelled requests of users to do so which is not an option for me.

Stéphanie C
  • 809
  • 8
  • 31
  • Could you specify your question/problem more precisely, please? What is specifically wrong or the beef you have with the (standard) approaches you listed? And what data do you have to start with/match against? – fnl Mar 23 '16 at 09:54
  • @fnl I have postal addresses so these techniques are not really efficient. Imagine for example a French address like "62 bvd Col Prevost", you would like it to match with "62 boulevard Colonel de Prevot", for example. It is harder than just matching two random strings. – Stéphanie C Mar 23 '16 at 14:38
  • Stéphanie, what you are describing then is an abbreviation expansion problem. There is plenty of research on that. Other than that, just split your problem into smaller ones. E.g., you can also look at that (particular) case as a string alignment problem, e.g. with the [Smith-Waterman algorithm](https://en.wikipedia.org/wiki/Smith%E2%80%93Waterman_algorithm). – fnl Mar 28 '16 at 09:13
  • @fnl do you have a good reference on abbreviation expansion problem you would recommend ? unless I will have a look on google. Thanks for your input – Stéphanie C Apr 01 '16 at 10:09
  • Stéphanie, my main speciality is BioNLP, and although I now work on commerical issues, I did not have any work-related problems that made me look into AE techniques outside of the medical and genomics domains. So my guess at good AE papers for your particular domain will be as good as yours. But if you are familiar with the biomedical domain, and want some pointers there, let me know, and I can post a long list of relevant literature. – fnl Apr 02 '16 at 10:38
  • @fnl, thanks a lot for your help, I work in official statistics, this is far from your field so far ! I will keep on digging with all the suggestions raised here – Stéphanie C Apr 02 '16 at 19:42

2 Answers2

1

I look at this as a spelling-correction problem, where you need to find the nearest-matching word in some sort of dictionary. What I mean by "near" is Levenshtein distance, except the smallest number of single-character insertions, deletions, and replacements is too restrictive. Other kinds of "spelling mistakes" are also possible, for example transposing two characters.

I've done this a few times, but not lately. The most recent case had to do with concomitant medications for clinical trials. You would be amazed how many ways there are to misspell "acetylsalicylic".

Here is an outline in C++ of how it is done.

Briefly, the dictionary is stored as a trie, and you are presented with a possibly misspelled word, which you try to look up in the trie. As you search, you try the word as it is, and you try all possible alterations of the word at each point. As you go, you have an integer budget of how may alterations you can tolerate, which you decrement every time you put in an alteration. If you exhaust the budget, you allow no further alterations.

Now there is a top-level loop, where you call the search. On the first iteration, you call it with a budget of 0. When the budget is 0, it will allow no alterations, so it is simply a direct lookup. If it fails to find the word with a budget of 0, you call it again with a budget of 1, so it will allow a single alteration. If that fails, try a budget of 2, and so on.

Something I have not tried is a fractional budget. For example, suppose a typical alteration reduces the budget by 2, not 1, and the budget goes 0, 2, 4, etc. Then some alterations could be considered "cheaper". For example, a vowel replacement might only decrement the budget by 1, so for the cost of one consonant replacement you could do two vowel replacements.

If the word is not misspelled, the time it takes is proportional to the number of letters in the word. In general, the time it takes is exponential in the number of alterations in the word.

If you are working in R (as I was in the example above), I would have it call out to the C++ program, because you need the speed of a compiled language for this.

Community
  • 1
  • 1
Mike Dunlavey
  • 40,059
  • 14
  • 91
  • 135
  • Do you think that's why Google does for autocompletion of addresses ? https://developers.google.com/places/web-service/autocomplete#place_autocomplete_responses – Stéphanie C Mar 21 '16 at 14:56
  • @StéphanieC: I don't know how Google does it. This might be part of what they do. – Mike Dunlavey Mar 21 '16 at 16:54
  • Thanks for your answer, I will check the code, but I am slightly disappointed since as a statistician I thought you would try to avoid testing all combinations (I know you have a budget limit, but still) – Stéphanie C Mar 21 '16 at 17:34
  • @StéphanieC: The budget is the key to how it works. This is a simple variation on [*branch-and-bound*](https://en.wikipedia.org/wiki/Branch_and_bound) search. When the budget is zero, it behaves like a direct trie search. When the budget is 1, it permits a single alteration, and so on. It is OK to do multiple passes, because the time is dominated by the time spent in the last pass. And there's an advantage to using tries - it is easy to put common prefixes and suffixes by sharing sub-tries. That drastically reduces the size of the dictionary. – Mike Dunlavey Mar 21 '16 at 21:55
0

Extending what Mike had to say, and using the string matching library stringdist in R to match a vector of addresses that errored out in ARCGIS's geocoding function:

rows<-length(unmatched$addresses)
#vector to put our matched addresses in
matched_add<-rep(NA, rows)
score<-rep(NA, rows)

#for instructional purposes only, you should use sapply to apply functions to vectors

 for (u in c(1:rows)){
     #gives you the position of the closest match in an address vector

     pos<-amatch(unmatched$address[u],index$address, maxDist = Inf)
     matched_add[u]<-index$address[pos]

     #stringsim here will give you the score to go back and adjust your
     #parameters

     score[u]<-stringsim(unmatched$address[u],index$address[pos])
} 

Stringdist has several methods you can use to find approximate matches, including Levenshtein (method="lv"). You'll probably want to tinker with these to fit your dataset as well as you can.