0

I'm thinking about what is the best strategy to approach the following problem and I would like to know your idea about it.

I have two tables with the following columns (ID_A, TEXT_A) and (ID_B, TEXT_B) and I have to evaluate using a NLP model the text similarity for each pair ID_A, ID_B.

Naturally this kind of problem leads to huge amount of pairs considering that it is a cross join. Thus, I discarded the idea of creating a table with all the information required to do the computation (ID_A, ID_B, TEXT_A, TEXT_B) and I tried to use two broadcast dictionaries {ID: TEXT}, which are then used inside my UDF to obtain the text corresponding to the examined pair.

However, as the size of the two tables is increasing over time, this solution doesn't scale very well in my opinion.

The alternative is to solve iteratively the problem, considering each time chunks of table and progressively appending the results to output storage.

Any alternative idea?

Thanks!

Luca
  • 183
  • 1
  • 8
  • Possible duplicate of [Efficient string matching in Apache Spark](https://stackoverflow.com/questions/43938672/efficient-string-matching-in-apache-spark) – 10465355 Nov 28 '18 at 16:13
  • I have to compute exact Scores and to store all results (not just top N). – Luca Nov 28 '18 at 16:19
  • If you need both exact and all pairs then Cartesian product is the only reasonable implementation. There is no hope for improving on that with such requirements. – 10465355 Nov 28 '18 at 23:13

0 Answers0