0

here's my problem

I have two tables (10M and 25 millions lines). I want to compare the addresses of these two tables.

My solution was to create an UDF(adress1, adress2) (using Jaccard)

String joinSql = "SELECT "
                    + "a.name, a.firstame, Jaccard(a.adress1,b.adress2) as jaccard "
                    + "FROM tmp_tableA as a, tmp_tableB as b "
                    + "where (Jaccard(a.adress1,b.adress.2) > 0.8);


System.out.println(joinSql);
Dataset<Row> dfr = spark.sql(joinSql);

It works but it takes ages. How can I optimize this?

Alper t. Turker
  • 34,230
  • 9
  • 83
  • 115
Jean
  • 601
  • 1
  • 10
  • 26

1 Answers1

1

MinHashLSH can be used for approxSimilarityJoins roughly equivalent to Jaccard distance.

You can check:

for details.

My answer to Efficient string matching in Apache Spark shows how you can prepare data.

Alper t. Turker
  • 34,230
  • 9
  • 83
  • 115