3

I want to perform a join based on Levenshtein distance.

I have 2 tables:

  1. Data: Which is a CSV in HDFS file repository. one of the columns is Disease description, 15K rows.
  2. df7_ct_map: a table I call from Hive. one of the columns is Disease Indication, 20K rows.

I'm trying to join both tables by matching each description with the indication (they are text descriptions of sicknesses). Ideally they need to be the same, but if both texts are different I wish to select matching text containing the maximum number of common words.

from pyspark.sql.functions import levenshtein  
joinedDF = df7_ct_map.join( Data, levenshtein(df7_ct_map("description"), 
Data("Indication")) < 3)
joinedDF.show(10)

The problem is Data is a DataFrame which is why I obtain the following error:

TypeError: 'DataFrame' object is not callable
TypeError                                 Traceback (most recent call last)
in engine
----> 1 joinedDF = df7_ct_map.join( Data, levenshtein(df7_ct_map("description"), Data("Indication")) < 3)

TypeError: 'DataFrame' object is not callable

Some advice? Can I use Fuzzywuzzy package? If so, how?

skrx
  • 19,980
  • 5
  • 34
  • 48
Lizou
  • 863
  • 1
  • 11
  • 16
  • 1
    Should be `df7_ct_map["description"]` not `df7_ct_map("description")`. Same for the other one: `Data["Indication"]` not `Data("Indication")`. I would also recommend https://stackoverflow.com/q/43938672/8371915 – Alper t. Turker Jan 24 '18 at 15:10
  • 1
    Thank you ! I will try replicate some best practices from your recommendation – Lizou Jan 24 '18 at 15:17

1 Answers1

12

Instead of joining using this the other option is as below

newDF=df1.join(df2,levenshtein(df1['description'], df2['description']) < 3)

This will allow a difference of at most 2 character while joining the 2 data frames.

might this is helpful.

Jalpesh Patel
  • 3,150
  • 10
  • 44
  • 68
Shveta Gupta
  • 136
  • 1
  • 2