In my db, I have a df of hundreds of thousand companies, and I have to retrieve them in another df that contains all companies existing.
To do that, I use PySpark :
def match_names(df_1, df_2):
pipeline = Pipeline(stages=[
RegexTokenizer(
pattern="", inputCol="name", outputCol="tokens", minTokenLength=1
),
NGram(n=3, inputCol="tokens", outputCol="ngrams"),
HashingTF(inputCol="ngrams", outputCol="vectors"),
MinHashLSH(inputCol="vectors", outputCol="lsh")
])
model = pipeline.fit(df_1)
stored_hashed = model.transform(df_1)
landed_hashed = model.transform(df_2)
landed_hashed = landed_hashed.withColumnRenamed('name', 'name2')
matched_df = model.stages[-1].approxSimilarityJoin(stored_hashed, landed_hashed, 1, "confidence").select(
col("datasetA.name"), col("datasetB.name2"), col("confidence"))
return matched_df
Then I also calculate Levenshtein distance of each pair.
It works for one hundred of rows to compare, but for hundreds of thousand, it takes too long time, and I really need to make it faster. I think we can parallelize it but I don't know how to do it.
Thanks in advance !