spark SQL like join performance

Asked Oct 28 '15 at 20:20

Active Oct 28 '15 at 20:20

Viewed 1,421 times

Do you guys have experience with spark SQL like join? Spark 1.5.0

sqlContext.sql("SELECT COUNT(*) FROM data a JOIN tokens b WHERE a.text LIKE CONCAT('%', token, '%')")

vs some ugliness like

sqlContext.sql("SELECT * FROM data a WHERE a.text LIKE '%token1%' UNION ALL SELECT * FROM data a WHERE a.text LIKE '%token2%' UNION ALL  ....")

or something similar without joining 2 tables with like join. data table would have tens of milions rows, text column about 100 characters and tokens table thousands of tokens (some of them with % inside). The second thing works much faster. The like join takes ages and the execution is suspicious as the time to finish tasks rise exponentially (I'd expect each partition takes same time to finish).

Thanks

asked Oct 28 '15 at 20:20

devopslife

1

See my answer here: http://stackoverflow.com/q/33168970/1560062 – zero323 Oct 28 '15 at 20:27

spark SQL like join performance

0 Answers0