12

I have two tables with a single key column. Keys in table a are subset of all keys in table b. I need to select keys from table b that are NOT in table a.

Here is a citation from Hive manual: "LEFT SEMI JOIN implements the uncorrelated IN/EXISTS subquery semantics in an efficient way. As of Hive 0.13 the IN/NOT IN/EXISTS/NOT EXISTS operators are supported using subqueries so most of these JOINs don't have to be performed manually anymore. The restrictions of using LEFT SEMI JOIN is that the right-hand-side table should only be referenced in the join condition (ON-clause), but not in WHERE- or SELECT-clauses etc."

They use this example for illustration:

    SELECT a.key, a.value FROM a WHERE a.key IN (SELECT b.key FROM B);

Is equivalent to

    SELECT a.key, a.val FROM a LEFT SEMI JOIN b ON (a.key = b.key);

However, what I need to do is first example with 'NOT IN;. Unfortunately this syntax is not supported in Hive 0.13. It's for illustration only:

    SELECT a.key, a.value FROM a WHERE a.key NOT IN (SELECT b.key FROM B);

I searched this site for recommendations, and saw this example:

    SELECT a.key FROM a LEFT OUTER JOIN b ON a.key = b.key WHERE b.key IS NULL;

It does not work as expected. When I join a.key NOT in b and a.key IN b, I don't get the original a this way. Maybe that is because this query cannot do the trick, note bold text - b.key should not appear in WHERE.

What should I do then? Any other trick? Thanks!

P.S. I cannot share any real data; it's a pretty simple example, where keys in a are all included in b and a is a subset of b.

Kim Moritz
  • 165
  • 1
  • 11
mel
  • 1,566
  • 5
  • 17
  • 29
  • In all databases that I know of, the `not in` and `left join` with `is null` are functionally equivalent, assuming the comparison values are not `NULL`. – Gordon Linoff Jul 30 '14 at 15:47
  • 1
    Thanks, but I don't see how that can help. What is the solution query? – mel Jul 30 '14 at 16:11
  • . . Your last query should be functionally equivalent to the `not in` version (assuming no `NULL` values). Perhaps you can edit your question and explain how it is not doing what you expect. I understand how the queries work. I don't understand what you expect. – Gordon Linoff Jul 30 '14 at 16:13
  • I explained how its not working on large data set, though on simple data it works correctly. The question was: "Is there any other way of doing what I need?" – mel Jul 30 '14 at 17:46
  • 1
    Here is an alternative solution. SELECT key FROM (SELECT a.key, b1.marker FROM (SELECT key, COUNT(2) AS marker FROM b GROUP BY key) b1 RIGHT OUTER JOIN a ON b1.key = a.key) t WHERE marker IS NULL; – mel Jul 30 '14 at 17:59
  • http://stackoverflow.com/questions/27854649/hive-use-not-exists-in-using-semi-join?lq=1 – Charudatta Feb 10 '15 at 13:30

4 Answers4

8

If you want results from table b, perhaps you can do the following instead?

  SELECT b.key FROM b LEFT OUTER JOIN a ON b.key = a.key WHERE a.key IS NULL;
okkar
  • 83
  • 1
  • 4
  • 2
    I want results from table a, not in b, so I cannot select anything from table b. Consider table b is a blacklist of IPs. I want to select everything from table a including IP, where not blacklisted. Hence, the join is on IP as a key column. – mel Apr 21 '15 at 14:34
  • @mel your question clearly says that you need to select keys from table b which are not in table a. OKK77's answer is potentially the right answer. Please accept! – nitinr708 Jul 17 '17 at 12:49
4

The answer to your issue should be

SELECT a.key FROM a LEFT OUTER JOIN b ON a.key = b.key WHERE b.key IS NULL;

This means, bring all the keys from a, irrespective of whether there is a match in b or not. The where cause will filter those records, which are not available in b.

Gyanendra Dwivedi
  • 5,511
  • 2
  • 27
  • 53
4

Or you can try

SELECT a.key FROM a LEFT ANTI JOIN b ON a.key = b.key
user3512680
  • 407
  • 4
  • 5
1

I tried left semi join for IN function in cdh 5.7.0 with spark 1.6 version.

The semi left join gives wrong results, which is not similar to IN function in sub queries.

j pavan kumar
  • 369
  • 1
  • 6
  • 14