Boradcast join is too slow in spark?

Question

I have 2 table like this:

Table A: ind , value1

(1,a)(2,b)(3,c) ... 3 billion rows

Table B: start_ind,end_ind,value2

(3,20,c)(78,99,y)(88,156,z) ... 500 million rows

And i write this query in spark sql

Select /*+ BROADCAST(B) */ A.ind,A.value1,B.value2
From A join B on A.ind between B.start_ind and B.end_ind;

I give 5 executor cores , 700 num of executor, 50gb memory to executor .

The query never ends

How can i improve the performance

How is your `B` dataset sorted? – Marc Le Bihan May 09 '23 at 21:31 — Marc Le Bihan, May 09 '23 at 21:31
B is not sorted but A is sorted – CompEng May 09 '23 at 21:32 — CompEng, May 09 '23 at 21:32
Have you tried it without broadcast? – Dima May 10 '23 at 00:00 — Dima, May 10 '23 at 00:00

score 0 · Answer 1 · answered May 10 '23 at 15:07

0

Spark Broadcast joins cannot be used when joining two large DataFrames and It's the case here.

I recommend you to follow Sim's propositions! this may be useful

answered May 10 '23 at 15:07

shalnarkftw

1 Answers1