Slightly different greatest-n-per-group

Question

I have read this comment which explains the greatest-n-per-group problem and its solution. Unfortunately, I am facing a slightly different approach, and I am failing to find a solution for it.

Let's suppose I have a table with some basic info regarding users. Due to implementation, this info may or may not repeat itself:

+----+-------------------+----------------+---------------+
| id |     user_name     | user_name_hash |    address    |
+----+-------------------+----------------+---------------+
|  1 | peter_jhones      | 0xFF321345     | Some Av       |
|  2 | sally_whiterspoon | 0x98AB5454     | Certain St    |
|  3 | mark_jackobson    | 0x0102AB32     | Some Av       |
|  4 | mark_jackobson    | 0x0102AB32     | Particular St |
+----+-------------------+----------------+---------------+

As you can see, mark_jackobson appears twice, although its address is different in each appearance.

Every now and then, an ETL process queries new user_names and fetches the most recent records of each. Aftewards, it stores the user_name_hash in a table to sign it has already imported that certain user_name

+----------------+
| user_name_hash |
+----------------+
| 0xFF321345     |
| 0x98AB5454     |
+----------------+

Everything begins with the following query:

SELECT DISTINCT user_name_hash
FROM my_table
EXCEPT
SELECT user_name_hash
FROM my_hash_table

This way, I am able to select the new hashes from my table. Since I need to query the most recent occurrence of a hash, I wrap it as a sub-query:

SELECT MAX(id)
FROM my_table
WHERE user_name_hash IN (
  SELECT DISTINCT user_name_hash
  FROM my_table
  EXCEPT
  SELECT user_name_hash
  FROM my_hash_table)
GROUP BY user_name_hash

Perfect! With the ids of my new users, I can query the addresses as follows:

SELECT
  address,
  user_name_hash
FROM my_table
WHERE Id IN (
  SELECT MAX(id)
  FROM my_table
  WHERE user_name_hash IN (
    SELECT DISTINCT user_name_hash
    FROM my_table
    EXCEPT
    SELECT user_name_hash
    FROM my_hash_table)
  GROUP BY user_name_hash)

From my perspective, the above query works, but it does not seem optimal. Reading this comment, I noticed I could query the same data, using joins. Since I am failing to write the desired query, could anyone help me out and point me to a direction?

This is the query I have attempted, without success.

SELECT
  tb1.address,
  tb1.user_name_hash
FROM my_table tb1
  INNER JOIN my_table tb2
    ON tb1.user_name_hash = tb2.user_name_hash
  LEFT JOIN my_hash_table ht
    ON tb1.user_name_hash = ht.user_name_hash AND tb1.id > tb2.id
WHERE ht.user_name_hash IS NULL;

Thanks in advance.

EDIT > I am working with PostgreSQL

Which [DBMS](https://en.wikipedia.org/wiki/DBMS) product are you using? Postgres? Oracle? "SQL" is just a query language, not the name of a specific database product. — , Dec 19 '17 at 16:20
@a_horse_with_no_name Postgres. Sorry for the lack of information — Vitor Durante, Dec 19 '17 at 16:39

score 1 · Accepted Answer · answered Dec 19 '17 at 16:33

1

I believe you are looking for something like this:

SELECT
  address,
  user_name_hash
FROM my_table t1
JOIN (
  SELECT MAX(id) maxid
  FROM my_table t2
  WHERE NOT EXISTS (
    SELECT 1
    FROM my_hash_table t3
    WHERE t2.user_name_hash = t3.user_name_hash
  )
  GROUP BY user_name_hash
) t ON t1.ID = t.maxid

I'm using NOT EXISTS instead of EXCEPT since it is more clear to the optimizer.

answered Dec 19 '17 at 16:33

Radim Bača

10,646
1
19
33

Great! Thank you! This decreased the Total Cost estimation from 4kk to 1.8kk! – Vitor Durante Dec 19 '17 at 17:15

score 0 · Answer 2 · answered Dec 19 '17 at 18:12

You can get a better performance using a left outer join (to get the newest records not already imported) and then compute the max id for these records (subquery in the HAVING clause).

SELECT  t1.address,
        t1.user_name_hash,
        MAX(id) AS maxid
FROM    my_table t1
LEFT JOIN my_hash_table th ON t1.user_name_hash = th.user_name_hash
WHERE th.user_name_hash IS NULL
GROUP BY t1.address,
         t1.user_name_hash
HAVING   MAX(id) = (SELECT MAX(id)
                    FROM my_table t1)

Slightly different greatest-n-per-group

2 Answers2