0

I m used to join tables in Pyspark using the following:

joined_data = users_data.join(shopping_list, users_data["name"] == 
shopping_list["users_nm"], "left_outer") 
joined_data .show(10)

The names and users_nm are strings but they might contain some unnecessary information. This is not giving me a good result. As an example :

  • name = Lepirudin is identical to natural hirudin except for substitution of leucine for isoleucine at the N-terminal end of the molecule and the absence of a sulfate group on the tyrosine at position 63. It is produced via yeast cells. Bayer ceased the production of lepirudin (Refludan) effective May 31, 2012.
  • users_nm = {""description"": ""Milrinone may increase the anticoagulant

    activities of Lepirudin.""

These 2 names are very differents but they mean the same thing : Lepirudin.

I am thinking of using livenshtein distance ...

Suggestions ?

Thank you

Lizou
  • 863
  • 1
  • 11
  • 16
  • And possible duplicate of [How can we JOIN two Spark SQL dataframes using a SQL-esque “LIKE” criterion?](https://stackoverflow.com/q/33168970/8371915) – Alper t. Turker Feb 05 '18 at 13:19
  • Not duplicated since I don't have a keyword list. This is different subject since the extended name can com from both dataframes. – Lizou Feb 05 '18 at 13:49

0 Answers0