PySpark Join for String data

Question

I m used to join tables in Pyspark using the following:

joined_data = users_data.join(shopping_list, users_data["name"] == 
shopping_list["users_nm"], "left_outer") 
joined_data .show(10)

The names and users_nm are strings but they might contain some unnecessary information. This is not giving me a good result. As an example :

name = Lepirudin is identical to natural hirudin except for substitution of leucine for isoleucine at the N-terminal end of the molecule and the absence of a sulfate group on the tyrosine at position 63. It is produced via yeast cells. Bayer ceased the production of lepirudin (Refludan) effective May 31, 2012.
users_nm = {""description"": ""Milrinone may increase the anticoagulant

activities of Lepirudin.""

These 2 names are very differents but they mean the same thing : Lepirudin.

I am thinking of using livenshtein distance ...

Suggestions ?

Thank you

And possible duplicate of [How can we JOIN two Spark SQL dataframes using a SQL-esque “LIKE” criterion?](https://stackoverflow.com/q/33168970/8371915) — Alper t. Turker, Feb 05 '18 at 13:19
Not duplicated since I don't have a keyword list. This is different subject since the extended name can com from both dataframes. — Lizou, Feb 05 '18 at 13:49

0 Answers0