improve my code to group same words in a large list python and comparison to other code

Question

I've been reading some of the other links ( What is a good strategy to group similar words? and Fuzzy Group By, Grouping Similar Words) that are related to group similar words. I'm curious (1) if someone can give me guidance as to how one of the algorithms I found in the second link works and (2) how the style of the programming compares to my own 'naive' approach?

If you can even just answer either 1 or 2, I'll give you an upvote.

(1) Can someone help step me through what's going on here?

class Seeder:
    def __init__(self):
        self.seeds = set()
        self.cache = dict()
    def get_seed(self, word):
        LIMIT = 2
        seed = self.cache.get(word,None)
        if seed is not None:
            return seed
        for seed in self.seeds:
            if self.distance(seed, word) <= LIMIT:
                self.cache[word] = seed
                return seed
        self.seeds.add(word)
        self.cache[word] = word
        return word

    def distance(self, s1, s2):
        l1 = len(s1)
        l2 = len(s2)
        matrix = [range(zz,zz + l1 + 1) for zz in xrange(l2 + 1)]
        for zz in xrange(0,l2):
            for sz in xrange(0,l1):
                if s1[sz] == s2[zz]:
                    matrix[zz+1][sz+1] = min(matrix[zz+1][sz] + 1, matrix[zz][sz+1] + 1, matrix[zz][sz])
                else:
                    matrix[zz+1][sz+1] = min(matrix[zz+1][sz] + 1, matrix[zz][sz+1] + 1, matrix[zz][sz] + 1)
        return matrix[l2][l1]

import itertools

def group_similar(words):
    seeder = Seeder()
    words = sorted(words, key=seeder.get_seed)
    groups = itertools.groupby(words, key=seeder.get_seed)

(2) In my approach I have a list of strings I want to group called residencyList and used default dictionaries.

Array(['Psychiatry', 'Radiology Medicine-Prelim',
       'Radiology Medicine-Prelim', 'Medicine', 'Medicine',
       'Obstetrics/Gynecology', 'Obstetrics/Gyncology',
       'Orthopaedic Surgery', 'Surgery', 'Pediatrics',
       'Medicine/Pediatrics',])

My effort to group. I base it off uniqueResList, which is np.unique(residencyList)

d = collections.defaultdict(int)
for i in residencyList:
    for x in uniqueResList:
        if x ==  i:
            if not d[x]:
                #print i, x
                d[x] = i  
                #print d
            if d[x]:
                d[x] = d.get(x, ()) + ', ' + i
        else:
            #print 'no match'
            continue

What kind of comparison are you going for? Length similarity? and have you looked into NLTK? — Back2Basics, Mar 07 '14 at 18:51

user3330510 · Answer 1 · 2014-03-07T21:04:01.573

A short explanation of the "ninja maths" in distance:

 # this is just the edit distance (Levenshtein) between the two words
    def distance(self, s1, s2):
        l1 = len(s1) # length of first word
        l2 = len(s2) # length of second word
        matrix = [range(zz,zz + l1 + 1) for zz in xrange(l2 + 1)] 
           # make an l2 + 1 by l1 + 1 matrix where the first row and column count up from
           # 0 to l1 and l2 respectively (these will be the costs of
           # deleting the letters that came before that element in each word)
        for zz in xrange(0,l2):
            for sz in xrange(0,l1):
                if s1[sz] == s2[zz]: # if the two letters are the same then we
                       # don't have to change them so take the 
                       # cheapest path from the options of
                       # matrix[zz+1][sz] + 1 (delete the letter in s1)
                       # matrix[zz][sz+1] + 1 (delete the letter in s2)
                       # matrix[zz][sz] (leave both letters)
                    matrix[zz+1][sz+1] = min(matrix[zz+1][sz] + 1, matrix[zz][sz+1] + 1, matrix[zz][sz])
                else: # if the two letters are not the same then we
                         # have to change them so take the 
                         # cheapest path from the options of
                         # matrix[zz+1][sz] + 1 (delete the letter in s1)
                         # matrix[zz][sz+1] + 1 (delete the letter in s2)
                         # matrix[zz][sz] + 1 (swap a letter)
                    matrix[zz+1][sz+1] = min(matrix[zz+1][sz] + 1, matrix[zz][sz+1] + 1, matrix[zz][sz] + 1)
        return matrix[l2][l1] # the value at the bottom of the matrix is equal to the cheapest set of edits

Thanks so much! this is useful. I didn't realize this was levenshtein. can you help me understand why a matrix is being used here? — user3314418, Mar 08 '14 at 14:56

RedBaron · Accepted Answer · 2014-03-07T18:56:11.773

I'll try to answer the first part. The class Seeder attempts to find seeds of the words. Two similar words are assumed to have the same seed and the similarity is controlled by the parameter LIMIT (In this case it is 2) which measures a distance between two words. There are many ways to compute String distance and your class does so in the distance function using some sort of ninja maths that is frankly above me.

def __init__(self):
    self.seeds = set()
    self.cache = dict()

Initialize the seeds as a set that keeps track of unique seeds till now and a cache that speeds up lookups in case we already have already seen the word (To save computation time).

For any word the get_seed function returns its seed.

def get_seed(self, word):
    #Set the acceptable distance
    LIMIT = 2
    #Have we seen this word before? 
    seed = self.cache.get(word,None)
    if seed is not None:
        #YES. Return from the cache
        return seed
    for seed in self.seeds:
        #NO. For each pre-existing seed, find the distance of this word from that seed
        if self.distance(seed, word) <= LIMIT:
            #This word is similar to the seed
            self.cache[word] = seed
            #We found this word's seed, cache it and return
            return seed
    #No we couldn't find a matching word in seeds. This is a new seed
    self.seeds.add(word)
    #Cache this word for future
    self.cache[word] = word
    #And return the seed (=word)
    return word

Then you sort the list of words in question by their seed. This ensures that words that have the same seed occur next to each other. This is important for the group by you use to form groups of words based on seed.

The distance function looks complicated and could possibly be replaced by something like Levenshtein.

Thanks so much! this is really useful. is this correct: the unique words serve as the initial seeds within a dictionary. Then we iterate through all the words and calculate their distance from the seed (e.g., unique word within the cache).. If the iterated word is below the limit, it becomes part of that seed. If there is no matching word, it becomes part of a new seed? — user3314418, Mar 08 '14 at 15:02
@user3314418 Almost, except that initially there are no seeds. (No unique words and all that). You iterate over the words and add them to seeds as you go. This is done in `words = sorted(words, key=seeder.get_seed)` — RedBaron, Mar 08 '14 at 15:13
thanks! one last q': what is this general style of programming called (with init's, functions, etc). Is this 'functional programming'? — user3314418, Mar 09 '14 at 13:55
@user3314418 Since you use classes and objects I'd say it is Object oriented Programming. It is definitely not functional programming although a style in which you follow fixed steps (and call functions) is called "procedural programming" — RedBaron, Mar 10 '14 at 04:29

improve my code to group same words in a large list python and comparison to other code

2 Answers2