Pyspark levenshtein Join error

Question

I want to perform a join based on Levenshtein distance.

I have 2 tables:

Data: Which is a CSV in HDFS file repository. one of the columns is Disease description, 15K rows.
df7_ct_map: a table I call from Hive. one of the columns is Disease Indication, 20K rows.

I'm trying to join both tables by matching each description with the indication (they are text descriptions of sicknesses). Ideally they need to be the same, but if both texts are different I wish to select matching text containing the maximum number of common words.

from pyspark.sql.functions import levenshtein  
joinedDF = df7_ct_map.join( Data, levenshtein(df7_ct_map("description"), 
Data("Indication")) < 3)
joinedDF.show(10)

The problem is Data is a DataFrame which is why I obtain the following error:

TypeError: 'DataFrame' object is not callable
TypeError                                 Traceback (most recent call last)
in engine
----> 1 joinedDF = df7_ct_map.join( Data, levenshtein(df7_ct_map("description"), Data("Indication")) < 3)

TypeError: 'DataFrame' object is not callable

Some advice? Can I use Fuzzywuzzy package? If so, how?

Should be `df7_ct_map["description"]` not `df7_ct_map("description")`. Same for the other one: `Data["Indication"]` not `Data("Indication")`. I would also recommend https://stackoverflow.com/q/43938672/8371915 — Alper t. Turker, Jan 24 '18 at 15:10
Thank you ! I will try replicate some best practices from your recommendation — Lizou, Jan 24 '18 at 15:17

score 12 · Accepted Answer · edited Jan 30 '18 at 08:16

12

Instead of joining using this the other option is as below

newDF=df1.join(df2,levenshtein(df1['description'], df2['description']) < 3)

This will allow a difference of at most 2 character while joining the 2 data frames.

might this is helpful.

edited Jan 30 '18 at 08:16

Jalpesh Patel

3,150
10
44
68

answered Jan 30 '18 at 04:52

Shveta Gupta

136
1
2

1

Is there a way to include the levenshtein distance value in the resulting dataframe? – nee21 Jul 27 '21 at 17:29

Pyspark levenshtein Join error

1 Answers1