I'm thinking about what is the best strategy to approach the following problem and I would like to know your idea about it.
I have two tables with the following columns (ID_A, TEXT_A) and (ID_B, TEXT_B) and I have to evaluate using a NLP model the text similarity for each pair ID_A, ID_B.
Naturally this kind of problem leads to huge amount of pairs considering that it is a cross join. Thus, I discarded the idea of creating a table with all the information required to do the computation (ID_A, ID_B, TEXT_A, TEXT_B) and I tried to use two broadcast dictionaries {ID: TEXT}, which are then used inside my UDF to obtain the text corresponding to the examined pair.
However, as the size of the two tables is increasing over time, this solution doesn't scale very well in my opinion.
The alternative is to solve iteratively the problem, considering each time chunks of table and progressively appending the results to output storage.
Any alternative idea?
Thanks!