I want to perform a join based on Levenshtein distance.
I have 2 tables:
Data
: Which is a CSV in HDFS file repository. one of the columns is Diseasedescription
, 15K rows.df7_ct_map
: a table I call from Hive. one of the columns is DiseaseIndication
, 20K rows.
I'm trying to join both tables by matching each description with the indication (they are text descriptions of sicknesses). Ideally they need to be the same, but if both texts are different I wish to select matching text containing the maximum number of common words.
from pyspark.sql.functions import levenshtein
joinedDF = df7_ct_map.join( Data, levenshtein(df7_ct_map("description"),
Data("Indication")) < 3)
joinedDF.show(10)
The problem is Data
is a DataFrame
which is why I obtain the following error:
TypeError: 'DataFrame' object is not callable
TypeError Traceback (most recent call last)
in engine
----> 1 joinedDF = df7_ct_map.join( Data, levenshtein(df7_ct_map("description"), Data("Indication")) < 3)
TypeError: 'DataFrame' object is not callable
Some advice? Can I use Fuzzywuzzy package? If so, how?