0

I have a rather large string dataset (lexicon of about 400'000 entries) currently loaded into a pd.DataFrame. From this I would like to compute a proximity map. In short, that's finding for each item the set of "nth-neighbors", where two n-neighbors are defined as having exactly int(n) chars differences, without any positional permutations ('their' and 'there' are 2-neighbors according to this definition). I know how to brute-force this, however I am not very used to deal with strings thus it would be great to learn a thing or two in the process of getting it done.

Taking advantage of the fact that per def neighbors must have same length, here is my current implementation :



df = pd.DataFrame(lexicon) # a dataset of words + stats 
# lexicon has a phonetic representation in df['phon'] and corresponding length in df['lenphons'] 

def nthNeighborsMap(lexDf:pd.DataFrame, nth:int) -> dict:
    ''' creates nth-neighbor index mapping from lexicon dataset'''
    vpd = dict() # indexed by rowId
    for ix,itemX in lexDf['phon'].items():
        vpd[ix] = [] 
        lexSl = lexDf[lexDf['lenphons'] == lexDf['lenphons'][ix] ] # must be same length 
        for iy,itemY in lexSl['phon'].items():
            if _isNeighbor(list(itemX), list(itemY), nth) is True:
                vpd[ix].append( iy )                            
    return vpd

def _isNeighbor(wrdX:list, wrdY:list, n:int) -> bool:
    ''' level n neighbor : n char diffs, without pos permutations'''
    assert len(wrdX) == len(wrdY) # neighbors must be same len (char replacement only)
    if wrdX == wrdY:
        return False # that's a given
    mismatchCount = 0
    for i,y in enumerate(wrdY):
        if y != wrdX[i]:
            mismatchCount += 1
        if mismatchCount > n:
            return False
    if mismatchCount == n:
        return True
    return False


lvl3_nMap = nthNeighborsMap(df , 3)

Without seeking ultimate performance, I am pretty confident there is room for improvement.

Here are a few ideas I was thinking about:

  • Learn to use regex or perl for that matter ?
  • Use a better algorithm (I don't have any formal CS training , thus a bit tricky to evaluate perfomance vs complexity )?
  • Find/use a dedicated library ?
  • Work on branch pruning ?

This question is mainly about getting some "good practice" advice's. Basically, how would a Pro tackle this situation ?

edit: sorry I made a quick last minute untested change before posting my original message, appears the code was plain wrong (definition was not properly met in method). Should be ok now ...

  • "Should be ok now" does this means this is fast now? By the way, Python is clearly not design for performance expect if you use vectorized operation (functions of native library like Numpy). Loops are generally painfully slow. This is because CPython is an interpreter maint to run glue code or basic scripts (not computationally intensive codes). CPython strings tends not to be fast because they are dynamic object (and also often because of unicode). I advise you to use a native language for this (and bounded ASCII strings if possible). – Jérôme Richard Mar 19 '23 at 16:25
  • @Jérome Richard, "Should be ok now" meaning it is functional (just corrected a stupid error). I know python is not the best for this kind of thing, which language would you recommend as a "low hanging fruit" (i.e. something not too time consuming as a beginner yet powerful enough ) ? I was thinking about perl, is it a good option ? thanks! – tikitakitok Mar 19 '23 at 18:05
  • AFAIK, Perl is interpreted too. Languages like C and C++ should be good for such task assuming the code is written efficiently (eg. avoiding allocations, memory indirections, etc.). C is simpler for beginners, but there are certainly fewer high-level library to help you doing this (eg. loading the dataframe). C++ should be fine though learning C++ might be a significant effort regarding your needs (this is not a wasted time though if you frequently need to optimize such computations). – Jérôme Richard Mar 19 '23 at 18:25
  • In general, I tend to advise using Numpy, Cython or Numba, but here, none seems to be able to be sufficient to make this much faster at first glance. In particular, Numba is very slow with strings, Cython does not help much because of dynamic unicode string objects and Numpy can hardly be used to speed up the loop in `_isNeighbor` because of the early break (and Numpy is not a silver-bullet for strings anyway). – Jérôme Richard Mar 19 '23 at 18:28
  • 1
    @JérômeRichard if strings are problem, maybe convert them into int16 or something? @ tikitakitok Regardless of that and chosen programming languange, I suggest more efficient algorithm, using tree. In [this question](https://stackoverflow.com/questions/67548039/nearest-neighbor-search-over-levenshtein-distance-in-python-using-metric-indexin) OP is trying to solve similar problem. – dankal444 Mar 19 '23 at 18:57
  • 1
    @dankal444 This is a good idea indeed, assuming the cost of the conversion is small compared to the computation (I guess this should be fine here). Then Numba/Cython can be used to optimize the code (without the need to learn a new language). By the way, I just saw that `lexDf[lexDf['lenphons'] == lexDf['lenphons'][ix] ]` can be replaced by a groubby (not sure this is the expensive part though). – Jérôme Richard Mar 19 '23 at 21:38
  • @dankal444, Thanks for the pointers, looking into it. I have a followup questions about numeric conversion if you don't mind: Among the many ways the mapping could be made , is there a theoretical 'best' ( eg. most efficient hash) ? – tikitakitok Mar 20 '23 at 11:42
  • @tikitakitok you need to [encode string into utf-32](https://stackoverflow.com/a/33787418/4601890). If your strings are "standard" you can risk enconding to utf-16. Next you need to convert from bytes to np.int32 using for example `np.frombuffer` – dankal444 Mar 20 '23 at 14:54

1 Answers1

0

I'll make the assumption that all the words in your dataframe can be represented using ascii (you'll be fine if they only contain letters from the classical latin alphabet).

From there, we can see a few areas that are inefficient in your original code:

  • As mentioned in the comments, loops in pure python are extremely slow.
  • You're doing double the necessary work: in your main loop, you first check if x is a N-neighbor with y, then if y is a N-neighbor with x. You can use the symmetry of the N-neighbor relation to only do the first check.

In the version below, I'll be encoding words as arrays of numbers, which allows me to apply efficient, vectorized operations. On top of that, I'll be using numba to compile the most costly code to much faster machine code.

The output format is exactly the same, which makes it easy to check that both programs are doing the exact same thing:

from numba import njit
import numpy as np
import pandas as pd

@njit
def _fast_find_dist(u, n, out):
    """
    :param u: an array of encoded words.
    :param n: the target number of matches.
    :param out: the array to write the output to.
    out[i,j] = True iff u[i] and u[j] are n-neighbors
    and i<j (we use the symetry to only fill half the matrix)
    """
    for i in range(u.shape[0] - 1):
        for j in range(i + 1, u.shape[0]):
            out[i,j] = (u[i]!=u[j]).sum() == n

class WordMapper():
    def __init__(self, df):
        """
        :param df: a dataframe containing a 'phon' and
        a 'lenphon' columns. 'phon' must contain ascii strings,
        and 'lenphon' their length.
        """
        # Group all words of the same length in len2numphons[length]
        # These words are represented using a uint8-encoded numpy array
        self.len2numphons = {}
        # Get len2idx[wordlen] is an array containing the
        # index of each word of length wordlen in df
        self.len2idx = {}
        for length, idx in df.groupby("lenphons").groups.items():
            self.len2numphons[length] = np.array([np.fromstring(x, dtype=np.uint8) for x in df["phon"].loc[idx].values])
            self.len2idx[length] = idx

    def neighbor_map(self, n:int)->dict[int, list[int]]:
        """
        Get mapping of neighbors at length n.
        """
        res = {}
        for length in self.len2numphons.keys():
            res |= self._find_dist(length, n)
        return res
    
    def _find_dist(self, length, n):
        # Call _fast_find_dist and
        # convert matching matrix to dict
        # (numba and dictionaries don't work too well)
        u = self.len2numphons[length]
        match_mat = np.zeros((u.shape[0], u.shape[0]), dtype=bool)
        _fast_find_dist(u, n, match_mat)
        len2idx = self.len2idx[length]
        res = {i: [] for i in len2idx}
        for i in range(match_mat.shape[0] - 1):
            for j in range(i + 1, match_mat.shape[0]):
                if match_mat[i,j]:
                    res[len2idx[i]].append(len2idx[j])
                    res[len2idx[j]].append(len2idx[i])
        return res

To benchmark the code, I use the english-words package to download a dataset of words:

from english_words import get_english_words_set

web2lowerset = get_english_words_set(['web2'], lower=True)
df = pd.DataFrame({"phon":list(web2lowerset), "lenphons":[len(x) for x in web2lowerset]})

# Run this answer's version:
wm = WordMapper(small_df)
wm.neighbor_map(3)

By sampling 40,000 random words and searching for 3-neighbors, I get the following runtimes:

  • Your version: 3min 9s (189s)
  • This answer's version: 9.9s

That makes it a 19x speedup for that dataset size.

Note: I believe this code could be further improved significantly by figuring out a way to create a dictionary directly in the numba section or by changing the output format. A large part of the remaining time is now spent converting the numba output to a dictionary to match the question's output format.

Seon
  • 3,332
  • 8
  • 27