2

I have 2 table like this:

Table A: ind , value1

(1,a)(2,b)(3,c) ... 3 billion rows

Table B: start_ind,end_ind,value2

(3,20,c)(78,99,y)(88,156,z) ... 500 million rows

And i write this query in spark sql

Select /*+ BROADCAST(B) */ A.ind,A.value1,B.value2
From A join B on A.ind between B.start_ind and B.end_ind;

I give 5 executor cores , 700 num of executor, 50gb memory to executor .

The query never ends

How can i improve the performance

CompEng
  • 7,161
  • 16
  • 68
  • 122

1 Answers1

0

Spark Broadcast joins cannot be used when joining two large DataFrames and It's the case here.

I recommend you to follow Sim's propositions! this may be useful

shalnarkftw
  • 402
  • 2
  • 8