0

I am learning Hive and wanted to write an optimized HiveQL/SQL query

My table looks like this:

CREATE TABLE sales (dealer VARCHAR(25), make VARCHAR(25), type VARCHAR(25), day INT);
INSERT INTO sales (dealer, make, type, day) VALUES
("Xyz", "Highlander", "SUV", "0"),
("Xyz", "Prius", "HATCH", "1"),
("Xyz", "Prius", "HATCH", "2"),
("Xyz", "Prius", "HATCH", "3"),
("Xyz", "Versa", "HATCH", "1"),
("Xyz", "Versa", "HATCH", "2"),
("Xyz", "Versa", "HATCH", "3"),
("Xyz", "S3", "SEDAN", "1"),
("Xyz", "S3", "SEDAN", "2"),
("Abc", "Forrester", "SUV", "1");

Given a "dealer" D, I want to compute the top N "make" for each "type" in the past X days, in a single query.

SELECT dealer, make, type, COUNT(*) AS frequency FROM sales
WHERE day > 0 AND dealer LIKE 'Xyz' GROUP BY make, type
ORDER BY frequency DESC LIMIT 5

The problem is when using GROUP BY on "make" and "type" for top 1, I will only get:

DEALER, MAKE, TYPE, COUNT
Xyz, Prius, Hatch, 3
Xyz, Versa, Hatch, 3
Xyz, S3, Sedan, 2
...

But I want

Xyz, Prius, Hatch, 3
Xyz, S3, Sedan, 2
...

for EACH "type" the top N.

Could someone help me understand how to write such a query?

SQL Fiddle http://sqlfiddle.com/#!2/df9304/5

****Update****

Seems like rank() would be useful:

Hive getting top n records in group by query

https://blogs.oracle.com/taylor22/entry/hive_0_11_may_15

HiveQL and rank()

Community
  • 1
  • 1
foobarometer
  • 751
  • 1
  • 9
  • 20
  • What query did you run to get the result you listed? – J Maurer Aug 23 '14 at 20:32
  • I did not run a query but my understanding of the docs: SELECT dealer, make, type, COUNT(*) AS frequency FROM table WHERE day > 0 AND dealer == 'Xyz' GROUP BY make, type ORDER BY frequency DESC LIMIT 5 – foobarometer Aug 23 '14 at 20:35
  • Say you want to compute the top 5 makes for each type...how is `xyz, versa, hatch, 3` not included in that list? – o-90 Aug 25 '14 at 01:12
  • 1
    @GoBrewers14, yeah you are right, I gave an example for top 1. Updated in question. See rank(), it seems that will be useful. – foobarometer Aug 25 '14 at 05:22

1 Answers1

0

After reading some more docs and the hints from the linked questions:

SELECT dealer, make, rank, type FROM (
    SELECT dealer, make, rank() OVER (PARTITION BY type ORDER BY count DESC) AS rank, type FROM (
        SELECT dealer, make, count(*) AS count, type FROM Sales WHERE dealer = "Xyz" GROUP BY dealer, type, make
    ) CountedSales
) RankedSales
WHERE RankedSales.rank < 3;

Inner query does counting, middle query performs rank() and the outer query limits on rank.

Sales table contents

hive> select * from Sales;
OK
Xyz      Highlander      SUV    NULL
Xyz      Highlander      SUV    NULL
Xyz      Rouge   SUV    NULL
Xyz      Rouge   SUV    NULL
Xyz      Prius   HATCH  NULL
Xyz      Prius   HATCH  NULL
Xyz      Prius   HATCH  NULL
Xyz      Versa   HATCH  NULL
Xyz      S3      SEDAN  NULL
Xyz      S3      SEDAN  NULL
Xyz      S3      SEDAN  NULL
Xyz      A8      SEDAN  NULL
Xyz      A8      SEDAN  NULL
Xyz      A8      SEDAN  NULL
Xyz      A8      SEDAN  NULL
Time taken: 0.054 seconds, Fetched: 15 row(s)

Now the actual query.

hive> SELECT dealer, make, rank, type FROM (                                                                          
    >     SELECT dealer, make, rank() OVER (PARTITION BY type ORDER BY count DESC) AS rank, type FROM (
    >         SELECT dealer, make, count(*) AS count, type FROM Sales WHERE dealer = "Xyz" GROUP BY dealer, type, make
    >     ) CountedSales
    > ) RankedSales
    > WHERE RankedSales.rank < 3;
...
Execution completed successfully
MapredLocal task succeeded
OK
Xyz      Prius  1        HATCH
Xyz      Versa  2        HATCH
Xyz      A8     1        SEDAN
Xyz      S3     2        SEDAN
Xyz      Rouge  1        SUV
Xyz      Highlander     1        SUV
Time taken: 28.491 seconds, Fetched: 6 row(s)
foobarometer
  • 751
  • 1
  • 9
  • 20