3

I'm trying to (INNER) join two tables in Hive using RLIKE.

select a.col_x, b.col_y, count(*) as n
from tableA a
join tableB b
ON a.col_x RLIKE concat('^', b.col_z)
group by a.col_x, b.col_y

(tableA apprx. 100M records, tableB apprx. 1k records)

This query won't work since Hive only supports equality join.

I move inequalities to the where clause (Ref: Error in Hive Query while joining tables).

select a.col_x, b.col_y, count(*) as n
from tableA a , tableB b
WHERE a.col_x RLIKE concat('^', b.col_z)
group by a.col_x, b.col_y

Second approach doesn't return any error. However, it run very very slow. Run approx 1 hour but still unable to get the output (normally, when I use equality join ON a.col_x RLIKE b.col_z it only need less than 5 minutes).

Any solution (using Hive or other hadoop applications) is appreciated.

leftjoin
  • 36,950
  • 8
  • 57
  • 116
nurandi
  • 1,588
  • 1
  • 11
  • 20

2 Answers2

1

Yes, "Hive does not support join conditions that are not equality conditions as it is very difficult to express such conditions as a map/reduce job." I take that to mean both equi-joins and natural joins (like your second example).

I would look into doing a CROSS JOIN with the RLIKE condition in the WHERE clause. Depending on the size of your data, you might have to look at doing the query per partition.

In Pig, you can also do a cross operation and then filter by your condition, similar to what is done in this post.

Community
  • 1
  • 1
JessicaOwensby
  • 189
  • 1
  • 7
  • Can you repartition the table - choosing a different partition key or maybe using a composite partition key? Or another option would be to separate your queries per partition into separate threads using java or some other language. – JessicaOwensby Sep 27 '16 at 15:02
1

In this particular case you can easily replace your not equality condition with equality one:

 a.col_x RLIKE concat('^', b.col_z)

is the same as

substr(a.col_x,1,length(b.col_z))=b.col_z
leftjoin
  • 36,950
  • 8
  • 57
  • 116