Get n grouped categories and sum others into one

Question

I have a table with the following structure:

Contents (
  id
  name
  desc
  tdate
  categoryid
  ...
)

I need to do some statistics with the data in this table. For example I want to get number of rows with the same category by grouping and id of that category. Also I want to limit them for n rows in descending order and if there are more categories available I want to mark them as "Others". So far I have come out with 2 queries to database:

Select n rows in descending order:

SELECT COALESCE(ca.NAME, 'Unknown') AS label
    ,ca.id AS catid
    ,COUNT(c.id) AS data
FROM contents c
LEFT OUTER JOIN category ca ON ca.id = c.categoryid
GROUP BY label
    ,catid
ORDER BY data DESC LIMIT 7

Select other rows as one:

SELECT 'Others' AS label
    ,COUNT(c.id) AS data
FROM contents c
LEFT OUTER JOIN category ca ON ca.id = c.categoryid
WHERE c.categoryid NOT IN ($INCONDITION)

But when I have no category groups left in db table I still get an "Others" record. Is it possible to make it in one query and make the "Others" record optional?

Can you provide sample Data(rows) and Expected output with complete DDL of table `Contents` ? — Vivek S., Apr 10 '15 at 11:37
Your table definition should be showing data types and constraints. Basically what you get with `\d contents` in psql. And some example data with expected output would go a long way to clarify. — Erwin Brandstetter, Jun 02 '15 at 14:02

score 4 · Accepted Answer · edited May 23 '17 at 12:14

The specific difficulty here: Queries with one or more aggregate functions in the SELECT list and no GROUP BY clause produce exactly one row, even if no row is found in the underlying table.

There is nothing you can do in the WHERE clause to suppress that row. You have to exclude such a row after the fact, i.e. in the HAVING clause, or in an outer query.

Per documentation:

If a query contains aggregate function calls, but no GROUP BY clause, grouping still occurs: the result is a single group row (or perhaps no rows at all, if the single row is then eliminated by HAVING). The same is true if it contains a HAVING clause, even without any aggregate function calls or GROUP BY clause.

It should be noted that adding a GROUP BY clause with only a constant expression (which is otherwise completely pointless!) works, too. See example below. But I'd rather not use that trick, even if it's short, cheap and simple, because it's hardly obvious what it does.

The following query only needs a single table scan and returns the top 7 categories ordered by count. If (and only if) there are more categories, the rest is summarized into 'Others':

WITH cte AS (
   SELECT categoryid, count(*) AS data
        , row_number() OVER (ORDER BY count(*) DESC, categoryid) AS rn
   FROM   contents
   GROUP  BY 1
   )
(  -- parentheses required again
SELECT categoryid, COALESCE(ca.name, 'Unknown') AS label, data
FROM   cte
LEFT   JOIN category ca ON ca.id = cte.categoryid
WHERE  rn <= 7
ORDER  BY rn
)
UNION ALL
SELECT NULL, 'Others', sum(data)
FROM   cte
WHERE  rn > 7         -- only take the rest
HAVING count(*) > 0;  -- only if there actually is a rest
-- or: HAVING  sum(data) > 0

You need to break ties if multiple categories can have the same count across the 7th / 8th rank. In my example, categories with the smaller categoryid win such a race.
Parentheses are required to include a LIMIT or ORDER BY clause to an individual leg of a UNION query.
You only need to join to table category for the top 7 categories. And it's generally cheaper to aggregate first and join later in this scenario. So don't join in the the base query in the CTE (common table expression) named cte, only join in the first SELECT of the UNION query, that's cheaper.
Not sure why you need the COALESCE. If you have a foreign key in place from contents.categoryid to category.id and both contents.categoryid and category.name are defined NOT NULL (like they probably should be), then you don't need it.

The odd `GROUP BY true`

This would work, too:

...

UNION ALL
SELECT NULL , 'Others', sum(data)
FROM   cte
WHERE  rn > 7
GROUP BY true;

And I even get slightly faster query plans. But it's a rather odd hack ...

SQL Fiddle demonstrating all.

Related answer with more explanation for the UNION ALL / LIMIT technique:

Sum results of a few queries and then find top 5 in SQL

This answer is incredible. Works as expected and also executes 2 times faster, than the one that is suggested by spencer7593. Thanks a lot! — iamawebgeek, Jun 03 '15 at 06:40
To get the same behavior as the original query, i.e. to consolidate rows from `contents` with an unmatched `categoryid` value all together, with a single count, under a `catid` value of NULL, the query in the common table expression would need to perform the outer join to `category`, and group by the `id` value from the `category` table. (The difference would only be observed with the consolidated count for 'Unknown' is one of the highest 7 counts.) — spencer7593, Jun 03 '15 at 12:01
@spencer7593: Good point. I *do* suspect that the original query was going in the wrong direction here, though. `contents` with different `categoryid` would typically not be lumped together, even if the `name` for the category is missing (be it a missing row in `category` or `name IS NULL`). In a design with referential integrity neither should be possible (I commented on that in my answer), but if it can happen, I suggest the best course of action is to keep categories separate and signify missing data with `'Unknown'` like I implemented it. — Erwin Brandstetter, Jun 04 '15 at 00:46
@ErwinBrandstetter: I concur, with every one of the points you made. *(An actual table definition would likely make this entire discussion moot, since that would reveal a NOT NULL constraint and reveal an (enabled and enforced) FOREIGN KEY constraint. Absent the foreign key constraint, there could be a multitude of rows with a large variety of values for `categoryid`. If we aggregate those together, we're only going to get (at most) one row of 'Unknown' category (which is what the original query does.) Aggregating by individual categoryid, we could get multiple rows with name 'Unknown'. — spencer7593, Jun 04 '15 at 02:12
@spencer7593: Exactly. That's also why I used `contents.categoryid` in the result instead of `category.id` to exclude the possibility of missing entry in the referenced table. Well, that and there is also hardly any good reason to have it otherwise. — Erwin Brandstetter, Jun 04 '15 at 08:20

score 1 · Answer 2 · answered Jun 03 '15 at 03:35

The quick fix, to make the 'Others' row conditional would be to add a simple HAVING clause to that query.

HAVING COUNT(c.id) > 0

(If there are no other rows in the contents table, then COUNT(c.id) is going to be zero.)

That only answers half the question, how to make the return of that row conditional.

The second half of the question is a little more involved.

To get the whole resultset in one query, you could do something like this

(this is not tested yet; desk checked only.. I'm not sure if postgresql accepts a LIMIT clause in an inline view... if it doesn't we'd need to implement a different mechanism to limit the number of rows returned.

  SELECT IFNULL(t.name,'Others') AS name
       , t.catid                 AS catid
       , COUNT(o.id)             AS data 
    FROM contents o
    LEFT 
    JOIN category oa
      ON oa.id = o.category_id
    LEFT
    JOIN ( SELECT COALESCE(ca.name,'Unknown') AS name
                , ca.id                       AS catid
                , COUNT(c.id)                 AS data
             FROM contents c
             LEFT
             JOIN category ca
               ON ca.id = c.categoryid
            GROUP 
               BY COALESCE(ca.name,'Unknown')
                , ca.id
            ORDER
               BY COUNT(c.id) DESC
                , ca.id DESC
            LIMIT 7
         ) t
      ON ( t.catid = oa.id OR (t.catid IS NULL AND oa.id IS NULL)) 
   GROUP
      BY ( t.catid = oa.id OR (t.catid IS NULL AND oa.id IS NULL)) 
       , t.catid
   ORDER
      BY COUNT(o.id) DESC
       , ( t.catid = oa.id OR (t.catid IS NULL AND oa.id IS NULL)) DESC
       , t.catid DESC
   LIMIT 7

The inline view t basically gets the same result as the first query, a list of (up to) 7 id values from category table, or 6 id values from category table and a NULL.

The outer query basically does the same thing, joining content with category, but also doing a check if there's a matching row from t. Because t might be returning a NULL, we have a slightly more complicated comparison, where we want a NULL value to match a NULL value. (MySQL conveniently gives us shorthand operator for this, the null-safe comparison operator <=>, but I don't think that's available in postgresql, so we have to express differently.

     a = b OR (a IS NULL AND b IS NULL)

The next bit is getting a GROUP BY to work, we want to group by the 7 values returned by the inline view t, or, if there's not matching value from t, group the "other" rows together. We can get that to happen by using a boolean expression in the GROUP BY clause.

We're basically saying "group by 'if there was a matching row from t'" (true or false) and then group by the row from 't'. Get a count, and then order by the count descending.

This isn't tested, only desk checked.

2 mistakes I fixed myself, `name` field is not in `group by` statement and changed `ifnull` to `coalesce`. After that it worked as expected. However after I tested your and @ErwinBrandstetter answer I found that your query is 2 times slower that his. Anyway thank you very much! I really appreciate your answer. — iamawebgeek, Jun 03 '15 at 06:37
The fixes are appropriate. My "MySQL" is showing. MySQL has an IFNULL function and (by default) extends the behavior of GROUP BY to allow extra non-aggregates in the SELECT list. MySQL also doesn't support Common Table Expressions or provide analytic functions like `row_number()`, used in the query in Erwin's answer. (We could get this query to run faster in MySQL, but then it wouldn't work in postgresql.) The query in Erwin's answer is faster because it only has to do the aggregation from `contents` one time (in the cte). The query in this answer does it twice. — spencer7593, Jun 03 '15 at 11:47
I do notice one difference in behavior between the query in this answer and Erwin's in a corner case... when rows in `content` contain several values in the `categoryid` column which don't appear in the `category` table. In the query in Erwin's answer, those rowswill get aggregated by the value of the `categoryid` column. But the query in this answer will aggregate all of those rows together with a NULL value for catid, as one collective total for 'Unknown' (like the original query does). The difference comes in when that aggregated set of 'Unknown' has one of the seven highest counts. — spencer7593, Jun 03 '15 at 11:56

Gordon Linoff · Answer 3 · 2015-04-11T01:44:30.497

0

You can approach this with nested aggregation. The inner aggregation calculates the counts along with a sequential number. You want to take everything whose number is 7 or less and then combine everything else into the others category:

SELECT (case when seqnum <= 7 then label else 'others' end) as label,
       (case when seqnum <= 7 then catid end) as catid, sum(cnt)
FROM (SELECT ca.name AS label, ca.id AS catid, COUNT(c.id) AS cnt,
             row_number() over (partition by ca.name, catid order by count(c.id) desc) as seqnum
      FROM contents c LEFT OUTER JOIN
           category ca
           ON ca.id = c.categoryid
      GROUP BY label, catid
     ) t
GROUP BY (case when seqnum <= 7 then label else 'others' end),
         (case when seqnum <= 7 then catid end) 
ORDER BY cnt DESC ;

edited Apr 11 '15 at 01:44

answered Apr 10 '15 at 11:39

Gordon Linoff

1,242,037
58
646
786

Getting an error : `ERROR: column "label" does not exist LINE 4: row_number() over (partition by label, catid or...` – iamawebgeek Apr 10 '15 at 11:42
Now another error: `ERROR: column "catid" does not exist LINE 4: ... row_number() over (partition by ca.name, catid orde...` – iamawebgeek May 15 '15 at 06:58

Get n grouped categories and sum others into one

3 Answers3

The odd `GROUP BY true`

Linked

Get n grouped categories and sum others into one

3 Answers3

The odd GROUP BY true

Linked

The odd `GROUP BY true`