SQL group by selecting top rows with possible nulls

Question

The example table:

id	name	create_time	group_id
1	a	2022-01-01 12:00:00	group1
2	b	2022-01-01 13:00:00	group1
3	c	2022-01-01 12:00:00	NULL
4	d	2022-01-01 13:00:00	NULL
5	e	NULL	group2

I need to get top 1 rows (with the minimal create_time) grouped by group_id with these conditions:

create_time can be null - it should be treated as a minimal value
group_id can be null - all rows with nullable group_id should be returned (if it's not possible, we can use coalesce(group_id, id) or sth like that assuming that ids are unique and never collide with group ids)
it should be possible to apply pagination on the query (so join can be a problem)
the query should be universal as much as possible (so no vendor-specific things). Again, if it's not possible, it should work in MySQL 5&8, PostgreSQL 9+ and H2

The expected output for the example:

id	name	create_time	group_id
1	a	2022-01-01 12:00:00	group1
3	c	2022-01-01 12:00:00	NULL
4	d	2022-01-01 13:00:00	NULL
5	e	NULL	group2

I've already read similar questions on SO but 90% of answers are with specific keywords (numerous answers with PARTITION BY like https://stackoverflow.com/a/6841644/5572007) and others don't honor null values in the group condition columns and probably pagination (like https://stackoverflow.com/a/14346780/5572007).

Your question, to be a [mre], should contain enough data, and expected output. Not everyone is a native English speaker, and can understand what you do mean when saying: "1 rows with lesser create_time". that's why you should specify desired output, and also what did you try to write this query yourself? Where did you have problems? — Luuk, Jan 01 '22 at 14:54
What do you mean with top 1 rows? Only one row? Your example output has 4 rows. — Thomas Aichinger, Jan 01 '22 at 15:14
You say, if create_time is null it should be treated as a minimal value. In your example output it is treated as maximum. — Thomas Aichinger, Jan 01 '22 at 15:15
SQL is full of specific standard keywords which some databases do not support. Try picking one database/version as a target. MySQL 5 will likely be your limiting factor and is missing support for much standard SQL, such as window functions and common table expressions. — Jon Armstrong, Jan 01 '22 at 15:16
@ThomasAichinger The `group_id` column should be used for grouping, and we need to select the top row with the smallest `create_time` among each group. In the example, we should get 2 rows with `group_id = null`, one with `group_id = group1` and the smallest `create_time` (in this case it has `2022-01-01 12:00:00` but if there was a row with null time, we'd return it instead), and one with `group_id = group2` and the smallest `create_time` (but now, it's a single row anyway) — amseager, Jan 01 '22 at 15:23
You say in your question you've already seen answers that you (presumably) can't use, but you haven't tagged your RDBMS so we don't know what you *can* use. MySql 5 and 8 support very different syntax. — Stu, Jan 01 '22 at 15:38
@Stu I've already stated in 4. that the desired query should work in (perfectly) any database or (at least) in 4 specified ones. But the former is preferable so I decided not to add any RDBMS tags in order not to get the vendor-specific answers again. — amseager, Jan 01 '22 at 15:47
Ok. There's a difference with using vendor-specific syntax and standard ANSI SQL that is simply not supported in an older product. — Stu, Jan 01 '22 at 15:49
I removed the conflicting DBMS tags. Please add only one tag for the database product you are really using. — , Jan 01 '22 at 17:35
@a_horse_with_no_name I haven't added any RDBMS tag, it was done by someone else — amseager, Jan 05 '22 at 17:52
I don't think the accepted answer handles the "nulls first" requirement. What happened there? — shawnt00, Jan 10 '22 at 22:45

shawnt00 · Answer 1 · 2022-01-01T23:09:06.400

select * from T t1
where coalesce(create_time, 0) = (
    select min(coalesce(create_time, 0)) from T t2
    where coalesce(t2.group_id, t2.id) = coalesce(t1.group_id, t1.id)
)

Not sure how you imagine "pagination" should work. Here's one way:

and (
    select count(distinct coalesce(t2.group_id, t2.id)) from T t2
    where coalesce(t2.group_id, t2.id) <= coalesce(t1.group_id, t1.id)
) between 2 and 5 /* for example */
order by coalesce(t1.group_id, t1.id)

I'm assuming there's an implicit cast from 0 to a date value with a resulting value lower than all those in your database. Not sure if that's reliable. (Try '19000101' instead?) Otherwise the rest should be universal. You could probably also parameterize that in the same way as the page range.

You've also got a potential a complication with potential collisions between the group_id and id spaces. Yours don't appear to have that problem though having mixed data types creates its own issues.

This all gets more difficult when you want to order by other columns like name:

select * from T t1
where coalesce(create_time, 0) = (
    select min(coalesce(create_time, 0)) from T t2
    where coalesce(t2.group_id, t2.id) = coalesce(t1.group_id, t1.id)
) and (
    select count(*) from (
        select * from T t1
        where coalesce(create_time, 0) = (
            select min(coalesce(create_time, 0)) from T t2
            where coalesce(t2.group_id, t2.id) = coalesce(t1.group_id, t1.id)
        )
    ) t3
    where t3.name < t1.name or t3.name = t1.name
        and coalesce(t3.group_id, t3.id) <= coalesce(t1.group_id, t1.id)
) between 2 and 5
order by t1.name;

That does handle ties but also makes the simplifying assumption that name can't be null which would add yet another small twist. At least you can see that it's possible without CTEs and window functions but expect these to also be a lot less efficient to run.

https://dbfiddle.uk/?rdbms=mysql_5.5&fiddle=9697fd274e73f4fa7c1a3a48d2c78691

Thorsten Kettner · Accepted Answer · 2022-01-02T01:33:09.310

You can combine two queries with UNION ALL. E.g.:

select id, name, create_time, group_id
from mytable
where group_id is not null
and not exists
(
  select null
  from mytable older
  where older.group_id = mytable.group_id
  and older.create_time < mytable.create_time  
)
union all
select id, name, create_time, group_id
from mytable
where group_id is null
order by id;

This is standard SQL and very basic at that. It should work in about every RDBMS.

As to pagination: This is usually costly, as you run the same query again and again in order to always pick the "next" part of the result, instead of running the query only once. The best approach is usually to use the primary key to get to the next part so an index on the key can be used. In above query we'd ideally add where id > :last_biggest_id to the queries and limit the result, which would be fetch next <n> rows only in standard SQL. Everytime we run the query, we use the last read ID as :last_biggest_id, so we read on from there.

Variables, however, are dealt with differently in the various DBMS; most commonly they are preceded by either a colon, a dollar sign or an at sign. And the standard fetch clause, too, is supported by only some DBMS, while others have a LIMIT or TOP clause instead.

If these little differences make it impossible to apply them, then you must find a workaround. For the variable this can be a one-row-table holding the last read maximum ID. For the fetch clause this can mean you simply fetch as many rows as you need and stop there. Of course this isn't ideal, as the DBMS doesn't know then that you only need the next n rows and cannot optimize the execution plan accordingly.

And then there is the option not to do the pagination in the DBMS, but read the complete result into your app and handle pagination there (which then becomes a mere display thing and allocates a lot of memory of course).

Is it ok if I run this query once with `limit` only (without `offset`), process it, and then run again in order to to get completely new results? — amseager, Jan 03 '22 at 20:33
As mentioned: the idea is that you know until which ID you've already read, so with `where id > :last_biggest_id` you get all unread data and with `LIMIT n` you get the next n rows only. The query stays the same, only the value of the ID bind variable changes. — Thorsten Kettner, Jan 03 '22 at 20:45

score 0 · Answer 3 · answered Jan 01 '22 at 17:16

0

I would guess

SELECT id, name, MAX(create_time), group_id
FROM tb GROUP BY group_id 
UNION ALL
SELECT id, name, create_time, group_id
FROM tb WHERE group_id IS NULL
ORDER BY name

I should point out that 'name' is a reserved word.

answered Jan 01 '22 at 17:16

Jiulin Teng

299
3
8

SQL group by selecting top rows with possible nulls

3 Answers3