I've been reading some of the other links ( What is a good strategy to group similar words? and Fuzzy Group By, Grouping Similar Words) that are related to group similar words. I'm curious (1) if someone can give me guidance as to how one of the algorithms I found in the second link works and (2) how the style of the programming compares to my own 'naive' approach?
If you can even just answer either 1 or 2, I'll give you an upvote.
(1) Can someone help step me through what's going on here?
class Seeder:
def __init__(self):
self.seeds = set()
self.cache = dict()
def get_seed(self, word):
LIMIT = 2
seed = self.cache.get(word,None)
if seed is not None:
return seed
for seed in self.seeds:
if self.distance(seed, word) <= LIMIT:
self.cache[word] = seed
return seed
self.seeds.add(word)
self.cache[word] = word
return word
def distance(self, s1, s2):
l1 = len(s1)
l2 = len(s2)
matrix = [range(zz,zz + l1 + 1) for zz in xrange(l2 + 1)]
for zz in xrange(0,l2):
for sz in xrange(0,l1):
if s1[sz] == s2[zz]:
matrix[zz+1][sz+1] = min(matrix[zz+1][sz] + 1, matrix[zz][sz+1] + 1, matrix[zz][sz])
else:
matrix[zz+1][sz+1] = min(matrix[zz+1][sz] + 1, matrix[zz][sz+1] + 1, matrix[zz][sz] + 1)
return matrix[l2][l1]
import itertools
def group_similar(words):
seeder = Seeder()
words = sorted(words, key=seeder.get_seed)
groups = itertools.groupby(words, key=seeder.get_seed)
(2) In my approach I have a list of strings I want to group called residencyList and used default dictionaries.
Array(['Psychiatry', 'Radiology Medicine-Prelim',
'Radiology Medicine-Prelim', 'Medicine', 'Medicine',
'Obstetrics/Gynecology', 'Obstetrics/Gyncology',
'Orthopaedic Surgery', 'Surgery', 'Pediatrics',
'Medicine/Pediatrics',])
My effort to group. I base it off uniqueResList, which is np.unique(residencyList)
d = collections.defaultdict(int)
for i in residencyList:
for x in uniqueResList:
if x == i:
if not d[x]:
#print i, x
d[x] = i
#print d
if d[x]:
d[x] = d.get(x, ()) + ', ' + i
else:
#print 'no match'
continue