7

I have a table

Name  pets
--------------
Andy  {dog:2, cat:1, bird:4}
John  {tiger:3, elephant:1, fish:2}
Mary  {dog:2, pig:2}

I want the find the pet type with maximum count for each person. In the event of tie, duplicate the row for each pet. The result should look like this:

Name  max_pet
------------------
Andy  bird
John  tiger
Mary  dog
Mary  pig

Currently, I exported the table and do it in python. But I am wondering can I achieve this using Presto/SQL query? Thanks!

Manfred Moser
  • 29,539
  • 13
  • 92
  • 123
Edamame
  • 23,718
  • 73
  • 186
  • 320

3 Answers3

10

There are several ways to do this. One approach is to use UNNEST to convert the map into rows, with one row per map entry. You can then use the rank() window function to assign a ranking to the pets for each name, after which you select only the top ranked item(s).

WITH people (name, pets) AS (
  VALUES
    ('Andy', map_from_entries(array[('dog', 2), ('cat', 1), ('bird', 4)])),
    ('John', map_from_entries(array[('tiger', 3), ('elephant', 1), ('fish', 2)])),
    ('Mary', map_from_entries(array[('dog', 2), ('pig', 2)]))
)
SELECT name, pet AS max_pet
FROM (
    SELECT name, pet, count,
           rank() OVER (PARTITION BY name ORDER BY count DESC) rnk
    FROM people
    CROSS JOIN UNNEST(pets) AS t (pet, count)
)
WHERE rnk = 1;
 name | max_pet 
------+---------
 Andy | bird    
 John | tiger   
 Mary | dog     
 Mary | pig     
(4 rows)

Using UNNEST is simple to understand, but doesn't work well if you need to combine it with other operations, or if you have duplicate names.

Another approach is to convert the map into an array using map_entries(), use filter() to select the pet(s) with a count that equals the maximum count, then use transform() to only return the pet name. At this point, you have an array of the maximum pets. You can then UNNEST it into multiple rows, or keep it as an array for further processing. filter() and transform() utilize a lambda expression which is a Presto specific extension to SQL.

WITH people (name, pets) AS (
  VALUES
    ('Andy', map_from_entries(array[('dog', 2), ('cat', 1), ('bird', 4)])),
    ('John', map_from_entries(array[('tiger', 3), ('elephant', 1), ('fish', 2)])),
    ('Mary', map_from_entries(array[('dog', 2), ('pig', 2)]))
)
SELECT
    name,
    transform(
        filter(
            map_entries(pets),
            e -> e[2] = array_max(map_values(pets))),
        e -> e[1]) AS max_pets
FROM people;
 name |  max_pets  
------+------------
 Andy | [bird]     
 John | [tiger]    
 Mary | [dog, pig] 
(3 rows)
David Phillips
  • 10,723
  • 6
  • 41
  • 54
  • `transform` would return `ARRAY` type, we need to add `[1]` after `transform` method . – Xiao Sep 11 '20 at 03:10
  • @Xiao I added the example output to the answer. The usage of an array is intentional, because there can be multiple max pets (as the original question shows). – David Phillips Sep 11 '20 at 23:20
  • Simply run this `FILTER(MAP_KEYS(pets), x -> pets[x] = array_max(map_values(pets))) AS max_pets` – user2236600 Jan 04 '23 at 15:39
2

For Athena users, where accessing row fields by index is not available (true at least for the 2nd version of the engine, have not checked the 3rd) the second approach from answer can be rewritten with map_keys and map access by index:

-- sample data
WITH people (name, pets) AS (
  VALUES
    ('Andy', map_from_entries(array[('dog', 2), ('cat', 1), ('bird', 4)])),
    ('John', map_from_entries(array[('tiger', 3), ('elephant', 1), ('fish', 2)])),
    ('Mary', map_from_entries(array[('dog', 2), ('pig', 2)]))
)

-- query
SELECT
    name,
    transform(
        filter(
            map_keys(pets),
            e -> pets[e] = array_max(map_values(pets))), -- we now that e is present in pets
        e -> e) AS max_pets
FROM people;

Output:

name max_pets
Andy [bird]
John [tiger]
Mary [dog, pig]

Or just use map_filter:

SELECT
    name,
    map_keys(
        map_filter(
            pets,
            (k,v) -> v = array_max(map_values(pets)))) max_pets
FROM people;
Guru Stron
  • 102,774
  • 10
  • 95
  • 132
0

Use the map_top_n_keys function:

SELECT map_top_n_keys(map(ARRAY[‘a’, ‘b’, ‘c’], ARRAY[1, 2, 3]), 2) — [‘c’, ‘b’]

Ori N
  • 555
  • 10
  • 22