9

Given are two lists containing strings.

  1. One contains the name of organisations (mostly universitys) all around the world - not only written in english but always using latin alphabet.

  2. The other list contains mostly full addresses in which strings (organisations) from the first list may occur.

An Example:

addresses = [
             "Department of Computer Science, Katholieke Universiteit Leuven, Leuven, Belgium",
             "Machine Learning and Computational Biology Research Group, Max Planck Institutes     Tübingen, Tübingen, Germany 72076",
             "Department of Computer Science and Engineering, University of Washington, Seattle, USA 98185",
             "Knowledge Discovery Department, Fraunhofer IAIS, Sankt Augustin, Germany 53754",    
             "Computer Science Department, University of California, Santa Barbara, USA 93106",
             "Fraunhofer IAIS, Sankt Augustin, Germany",
             "Department of Computer Science, Cornell University, Ithaca, NY",
             "University of Wisconsin-Madison"
            ]

organisations = [
                 "Catholic University of Leuven"
                 "Fraunhofer IAIS"
                 "Cornell University of Ithaca"
                 "Tübingener Max Plank Institut"
                ]

As you can see the desired mapping would be:

"Department of Computer Science, Katholieke Universiteit Leuven, Leuven, Belgium",
--> Catholic University of  Leuven
"Machine Learning and Computational Biology Research Group, Max Planck Institutes     Tübingen, Tübingen, Germany 72076",
--> Max Plank Institut Tübingen
"Department of Computer Science and Engineering, University of Washington, Seattle, USA 98185",
--> --
"Knowledge Discovery Department, Fraunhofer IAIS, Sankt Augustin, Germany 53754",
--> Fraunhofer IAIS 
"Computer Science Department, University of California, Santa Barbara, USA 93106",
"Fraunhofer IAIS, Sankt Augustin, Germany",
--> Fraunhofer IAIS
"Department of Computer Science, Cornell University, Ithaca, NY"
--> "Cornell University of Ithaca",
"University of Wisconsin-Madison",
--> --

My thinking was to use some kind of "disctance- algorithm" to calculate the similarity of the strings. Since I cannot just look for an organisation in an address just by doing if address in organisation because it could be written slightly differently at in different places. So my first guess was using the difflib module. Especially the difflib.get_close_matches() function for selecting for every address the closest string from the organisations list. But I am not quite confident, that the results will be accurate enough. Although I don't know how high I should set the ratio which seams to be a similarity measure.

Before spending too much time in trying the difflib module I thought of asking the more experienced people here, if this is the right approach or if there is a more suited tool / way to solve my problem. Thanks!

PS: I don't need an optimal solution.

Aufwind
  • 25,310
  • 38
  • 109
  • 154
  • Maybe useful to you: http://stackoverflow.com/questions/577463/finding-how-similar-two-strings-are – Marijn van Vliet Dec 08 '11 at 14:53
  • @Rodin: The [link you posted](http://stackoverflow.com/a/577551/641514) says *The Levenstein distance measures distance in terms of the number of "operations" required to transform one string to another. These operations include insertion, deletion and substition.* My organisations list has around 8000 entries, my addresses list has 230000 items. Is Levenstein still the way to go, if the organisations strings are very short (example *Fraunhofer IAIS*) and the address is very long (example: *Knowledge Discovery Department, Fraunhofer IAIS, Sankt Augustin, Germany 53754*)? – Aufwind Dec 08 '11 at 15:00
  • 2
    I think you're on the right track with your assumption. Levenshtein distance comes to mind, and this: http://en.wikipedia.org/wiki/Bitap_algorithm. – hochl Dec 08 '11 at 15:00
  • @hochl: Even regarding the *concerns* I commented above? – Aufwind Dec 08 '11 at 15:02
  • 1
    @Aufwind: You should split your strings to words before comparing them, then count the number of matched words. For example, for "Fraunhofer IAIS", search each address for similar words for both "Fraunhofer" and "IAIS". You should also normalize the case of all words (e.g., lower-case), and might want to ignore "noise words" like "of". Give scores like "exact match = 5, close match = 1" and take the address with the highest score. Maybe a good heuristic would be to give long matches higher scores. – Ferdinand Beyer Dec 08 '11 at 15:06
  • 1
    Oh, here is a very interesting reading closely related to your problem: "How to Write a Spelling Corrector" http://norvig.com/spell-correct.html – Ferdinand Beyer Dec 08 '11 at 15:10
  • +1 good suggestion, maybe split by words and only words longer than 4 characters. Maybe exclude words consisting of only uppercase. – hochl Dec 08 '11 at 15:10

2 Answers2

2

Use the following as your string distance function (instead of plain levenshtein distance):

def strdist(s1, s2):
    words1 = set(w for w in s1.split() if len(w) > 3)
    words2 = set(w for w in s2.split() if len(w) > 3)

    scores = [min(levenshtein(w1, w2) for w2 in words2) for w1 in words1]
    n_shared_words = len([s for s in scores if s <= 3])
    return -n_shared_words 

Then use the Munkres assignment algorithm shown here since there appears to be a 1:1 mapping between organisations and adresses.

Community
  • 1
  • 1
Björn Lindqvist
  • 19,221
  • 20
  • 87
  • 122
0

You can use soundex or metaphone to translate the sentence into a list of phonems, then compare the most similar lists.

Here is a Python implementation of the double-metaphone algo.

Bite code
  • 578,959
  • 113
  • 301
  • 329