0

I have a table similar to the following.

|name  |   grp   |  dt    
------------------------------   
|foo   |    A    |  2016-01-01
|bar   |    A    |  2016-01-02
|hai   |    B    |  2016-01-01
|bai   |    B    |  2016-01-02
|baz   |    C    |  2016-01-01

For each group, I want to find the name whose dt is the most recent. In other words, MAX(dt), GROUP by grp, and associate the name whose dt is the max of the group to the output:

|name  |   grp   |  dt    
------------------------------
|bar   |   A     | 2016-01-02
|bai   |   B     | 2016-01-02
|baz   |   C     | 2016-01-01

In Oracle, the following query works and is very clean (taken from here):

SELECT o.name, o.grp, o.dt 
FROM tab o
     LEFT JOIN tab b
         ON o.grp = b.grp AND o.dt < b.dt
WHERE b.dt IS NULL

However this fails with [Error 10017]: Line 4:43 Both left and right aliases encountered in JOIN 'service_effective_from' From another question quoting the documentation, I learn that I cannot use an inequality operator in a join statement:

Only equality joins, outer joins, and left semi joins are supported in Hive. Hive does not support join conditions that are not equality conditions as it is very difficult to express such conditions as a map/reduce job.

What is a clean solution for obtaining this in Hive, given that I cannot use an inequality operator in a join condition?

Community
  • 1
  • 1
Matthew Moisen
  • 16,701
  • 27
  • 128
  • 231
  • If you want to find the `name` whose `dt` is the most recent per group then why is `foo` returned? It's associated with `2016-01-01` which is not the most recent in group `A`. – o-90 Jun 02 '16 at 03:00
  • @GoBrewers14 My apologies; you are correct; I have edited it to be `bar` – Matthew Moisen Jun 02 '16 at 06:12

1 Answers1

0

The following works and is taken from here, but I don't find it very clean:

SELECT o.name, ogrp, o.dt
FROM tab o
    JOIN (
        SELECT grp, MAX(dt) dt
        FROM tab
        GROUP BY grp
    ) b
        ON o.grp = b.grp AND o.dt = b.dt

As an aside, it takes 164 seconds on my environment for a comparable test table with 4 rows.

Community
  • 1
  • 1
Matthew Moisen
  • 16,701
  • 27
  • 128
  • 231