I'm trying to (INNER) join two tables in Hive using RLIKE.
select a.col_x, b.col_y, count(*) as n
from tableA a
join tableB b
ON a.col_x RLIKE concat('^', b.col_z)
group by a.col_x, b.col_y
(tableA apprx. 100M records, tableB apprx. 1k records)
This query won't work since Hive only supports equality join.
I move inequalities to the where
clause (Ref: Error in Hive Query while joining tables).
select a.col_x, b.col_y, count(*) as n
from tableA a , tableB b
WHERE a.col_x RLIKE concat('^', b.col_z)
group by a.col_x, b.col_y
Second approach doesn't return any error. However, it run very very slow. Run approx 1 hour but still unable to get the output (normally, when I use equality join ON a.col_x RLIKE b.col_z
it only need less than 5 minutes).
Any solution (using Hive or other hadoop applications) is appreciated.