How to add a condition into a join to select a single row, while having multiple matches?

Question

Take two datasets with a shared key which is not unique and variables called 'compX'. I want to do a left join on table A on the shared column and the closest possible 'compX' variable if such a variable exists within a predefined range.

(see image) In the following example take the second row in table A, the shared value is 124. Two rows with the same shared value exist in table B, however only the row with comp2 == 8, is within the range of comp1 (0 <= 8 <= 20// comp1 - range <= comp2 <= range + comp2) and therefore these two rows are matched.

For the fourth row in table A with shared = 125, we match it with comp1 = 91, since 91 is within range AND closer to 90 than 92 (value of comp2 in row 5 of table B).

Is there a way to do this join efficiently using pyspark?
enter image description here

welcome to stackoverflow! please review the guidelines in posting a reproducible example https://stackoverflow.com/help/minimal-reproducible-example — Naveed, Oct 06 '22 at 21:32
[Why should I not upload images of code/data/errors when asking a question?](https://meta.stackoverflow.com/q/285551/3404097) [Why are images of text, code and mathematical expressions discouraged?](https://meta.stackexchange.com/q/320052/266284) — philipxy, Oct 07 '22 at 04:13
Does this answer your question? [Fetch the rows which have the Max value for a column for each distinct value of another column](https://stackoverflow.com/questions/121387/fetch-the-rows-which-have-the-max-value-for-a-column-for-each-distinct-value-of) — philipxy, Oct 07 '22 at 04:14

How to add a condition into a join to select a single row, while having multiple matches?

0 Answers0