Faster efficient string matching in PySpark

Question

In my db, I have a df of hundreds of thousand companies, and I have to retrieve them in another df that contains all companies existing.

To do that, I use PySpark :

def match_names(df_1, df_2):

    pipeline = Pipeline(stages=[
        RegexTokenizer(
            pattern="", inputCol="name", outputCol="tokens", minTokenLength=1
        ),
        NGram(n=3, inputCol="tokens", outputCol="ngrams"),
        HashingTF(inputCol="ngrams", outputCol="vectors"),
        MinHashLSH(inputCol="vectors", outputCol="lsh")
    ])

    model = pipeline.fit(df_1)

    stored_hashed = model.transform(df_1)
    landed_hashed = model.transform(df_2)
    landed_hashed = landed_hashed.withColumnRenamed('name', 'name2')

    matched_df = model.stages[-1].approxSimilarityJoin(stored_hashed, landed_hashed, 1, "confidence").select(
            col("datasetA.name"), col("datasetB.name2"), col("confidence"))

    return matched_df

Then I also calculate Levenshtein distance of each pair.

It works for one hundred of rows to compare, but for hundreds of thousand, it takes too long time, and I really need to make it faster. I think we can parallelize it but I don't know how to do it.

Thanks in advance !

This question is better suited for https://codereview.stackexchange.com/. Provide profiling information when you ask it there, that info will help those who wish to help you. — Michael Ruth, Sep 10 '21 at 08:03
What other information do you need ? I just have one list of companies name, and I would like to find the fastest way to match it with another list of companies name... — Aym34, Sep 10 '21 at 08:10
I was very clear, but I'll reiterate: this question is inappropriate and not [on-topic](https://stackoverflow.com/help/on-topic) for SO, it belongs on [Code Review](https://codereview.stackexchange.com/). You should provide the output of `cProfile` or another profiling utility to those who may help you optimize your code. Optimization is rarely straight-forward when third-party libraries are involved and profiling output will help avoid being ["too clever by half."](https://english.stackexchange.com/questions/49107/origin-of-too-clever-by-half) — Michael Ruth, Sep 10 '21 at 08:22
I don't want someone Optimize this function. I want a better way. For example, if someone uses a for loop and ask for help. You won't answer him to provide profiling information. But if he can use numPy, you will tell it. I should have said that I didn't do anything and that I'm looking for a way to match two lists containing hundreds of thousands of lines. In that case, you might have helped me. — Aym34, Sep 10 '21 at 08:39
And I'm really sorry, but at work, I don't have any profiling tools. — Aym34, Sep 10 '21 at 08:41
Does this help? https://stackoverflow.com/questions/38969383/fuzzy-string-matching-in-python — BernardL, Sep 10 '21 at 08:47
I'm happy to help, but I also have a responsibility to highlight questions that are off-topic. If you post your question in Code Review, I'll be happy to help. And please understand that "a better way" is optimization regardless of the presentation, anything that falls into the "make my code faster" category is optimization. SO is for bugfixes, not optimization of working code. — Michael Ruth, Sep 10 '21 at 08:57
Yes BernardL, thank you, I've read your topic yesterday, and I have the same issue. Even with a few preprocessing, it takes too long time... I also try using Using TF-IDF with N-Grams and CosineSimilarity but if I understood correctly, It multiplies two squares matrix, so We can't use this method.... — Aym34, Sep 10 '21 at 09:18
Dac Michael, I understand what you mean, so I will post this topic in Code review. — Aym34, Sep 10 '21 at 09:22
Levenshtein is long and memory consuming. Did you try Hamming distance instead? It is frequently used today, as faster, even less precise than Levenshtein. — Catalina Chircu, Sep 16 '21 at 09:54

Faster efficient string matching in PySpark

0 Answers0