I m used to join tables in Pyspark using the following:
joined_data = users_data.join(shopping_list, users_data["name"] ==
shopping_list["users_nm"], "left_outer")
joined_data .show(10)
The names and users_nm are strings but they might contain some unnecessary information. This is not giving me a good result. As an example :
- name = Lepirudin is identical to natural hirudin except for substitution of leucine for isoleucine at the N-terminal end of the molecule and the absence of a sulfate group on the tyrosine at position 63. It is produced via yeast cells. Bayer ceased the production of lepirudin (Refludan) effective May 31, 2012.
users_nm = {""description"": ""Milrinone may increase the anticoagulant
activities of Lepirudin.""
These 2 names are very differents but they mean the same thing : Lepirudin.
I am thinking of using livenshtein distance ...
Suggestions ?
Thank you