I have a rather large string dataset (lexicon of about 400'000 entries) currently loaded into a pd.DataFrame. From this I would like to compute a proximity map. In short, that's finding for each item the set of "nth-neighbors", where two n-neighbors are defined as having exactly int(n) chars differences, without any positional permutations ('their' and 'there' are 2-neighbors according to this definition). I know how to brute-force this, however I am not very used to deal with strings thus it would be great to learn a thing or two in the process of getting it done.
Taking advantage of the fact that per def neighbors must have same length, here is my current implementation :
df = pd.DataFrame(lexicon) # a dataset of words + stats
# lexicon has a phonetic representation in df['phon'] and corresponding length in df['lenphons']
def nthNeighborsMap(lexDf:pd.DataFrame, nth:int) -> dict:
''' creates nth-neighbor index mapping from lexicon dataset'''
vpd = dict() # indexed by rowId
for ix,itemX in lexDf['phon'].items():
vpd[ix] = []
lexSl = lexDf[lexDf['lenphons'] == lexDf['lenphons'][ix] ] # must be same length
for iy,itemY in lexSl['phon'].items():
if _isNeighbor(list(itemX), list(itemY), nth) is True:
vpd[ix].append( iy )
return vpd
def _isNeighbor(wrdX:list, wrdY:list, n:int) -> bool:
''' level n neighbor : n char diffs, without pos permutations'''
assert len(wrdX) == len(wrdY) # neighbors must be same len (char replacement only)
if wrdX == wrdY:
return False # that's a given
mismatchCount = 0
for i,y in enumerate(wrdY):
if y != wrdX[i]:
mismatchCount += 1
if mismatchCount > n:
return False
if mismatchCount == n:
return True
return False
lvl3_nMap = nthNeighborsMap(df , 3)
Without seeking ultimate performance, I am pretty confident there is room for improvement.
Here are a few ideas I was thinking about:
- Learn to use regex or perl for that matter ?
- Use a better algorithm (I don't have any formal CS training , thus a bit tricky to evaluate perfomance vs complexity )?
- Find/use a dedicated library ?
- Work on branch pruning ?
This question is mainly about getting some "good practice" advice's. Basically, how would a Pro tackle this situation ?
edit: sorry I made a quick last minute untested change before posting my original message, appears the code was plain wrong (definition was not properly met in method). Should be ok now ...