here's my problem
I have two tables (10M and 25 millions lines). I want to compare the addresses of these two tables.
My solution was to create an UDF(adress1, adress2) (using Jaccard)
String joinSql = "SELECT "
+ "a.name, a.firstame, Jaccard(a.adress1,b.adress2) as jaccard "
+ "FROM tmp_tableA as a, tmp_tableB as b "
+ "where (Jaccard(a.adress1,b.adress.2) > 0.8);
System.out.println(joinSql);
Dataset<Row> dfr = spark.sql(joinSql);
It works but it takes ages. How can I optimize this?