Cassandra & Spark : how to avoid Cartesian Product (Jaccard)

Question

here's my problem

I have two tables (10M and 25 millions lines). I want to compare the addresses of these two tables.

My solution was to create an UDF(adress1, adress2) (using Jaccard)

String joinSql = "SELECT "
                    + "a.name, a.firstame, Jaccard(a.adress1,b.adress2) as jaccard "
                    + "FROM tmp_tableA as a, tmp_tableB as b "
                    + "where (Jaccard(a.adress1,b.adress.2) > 0.8);


System.out.println(joinSql);
Dataset<Row> dfr = spark.sql(joinSql);

It works but it takes ages. How can I optimize this?

score 1 · Answer 1 · answered Aug 22 '17 at 21:31

1

MinHashLSH can be used for approxSimilarityJoins roughly equivalent to Jaccard distance.

You can check:

for details.

My answer to Efficient string matching in Apache Spark shows how you can prepare data.

answered Aug 22 '17 at 21:31

Alper t. Turker

34,230
9
83
115

Cassandra & Spark : how to avoid Cartesian Product (Jaccard)

1 Answers1