Postgres query that shows the results from 2 columns with the highest counts per month

Question

I have a database with about 2500 results from 2022. I'm trying to build queries that will show me the top industry and country results from each month throughout the entire year. The preferred output would include the top X (number) of each category. I'm fine with breaking this out into two queries, one for each category, if needed.

The table looks like this:

post_id int
post_date date
post_victim text
threat_group text
victim_industry text
victim_country text

The idea here is that I can use these queries to put together data to brief teammates or generate graphs for visualizing the resulting data.

Here's an example of what I'd like to output via a SQL query (shortened to just Jan-Apr, but the final result should cover the entire year):

Industry	Jan	Feb	Mar	Apr
Healthcare	95	79	58	12
Manufacturing	45	90	72	65
Finance	31	48	21	73
Education	63	70	46	92
Technology	86	34	99	25
Construction	53	73	16	29
Retail	70	90	95	73
Government	95	73	21	58
Agriculture	34	53	61	99
Transportation	60	89	22	38
Hospitality	70	38	96	79
Legal	89	67	92	43

I tried using crosstab and specifying the following as ct (...:

SELECT *
FROM crosstab(
  $$
  SELECT to_char(post_date, 'YYYY-MM') as month, victim_industry, count(*)
  FROM ransomware_posts
  WHERE post_date >= '2022-01-01' AND post_date < '2023-01-01'
  GROUP BY month, victim_industry
  ORDER BY month, count DESC
  $$
) as ct (month text, industry_1 text, count_1 int, industry_2 text, count_2 int)

I think the issue may have something to do with the fact that there are more than 2 results per month, so the output is receiving more results than it's expecting, but I don't know for sure as I've never built a functioning crosstab query.

Here's the error I receive:

ERROR:  return and sql tuple descriptions are incompatible
SQL state: 42601

Edouard · Accepted Answer · 2023-01-04T20:33:52.830

1

The error message comes from a gap between the structure resulting ffrom the internal query and the structure returned by the crosstab function.

This query should work :

SELECT *
FROM crosstab($$
  SELECT victim_industry, to_char(post_date, 'YYYY-MM') :: text as month, count(*) :: int
  FROM ransomware_posts
  WHERE post_date >= '2021-01-01' AND post_date < '2022-01-01'
  GROUP BY victim_industry, month
  ORDER BY victim_industry, month $$
) as ct (industry text, Jan int, Feb int, Mar int, Apr int, May int, Jun int, Jul int, Aug int, Sep int, Oct int, Nov int, Dec int)

UPDTATE

crosstab doesn't manage well the gaps if any.

There is another solution which relies on a composite type and the standard function jsonb_populate_record :

First creating a dedicated composite type :

CREATE TYPE months AS (jan int, feb int, mar int, apr int, jun int, jul int, aug int, sep int, oct int, nov int, dec int)

Then the following query provides the expected result :

SELECT r.victim_industry, (jsonb_populate_record(null :: months, jsonb_object_agg(lower(r.date), r.count))).*
  FROM 
     ( SELECT victim_industry, to_char(post_date, 'Mon') AS date, count(*) AS count
         FROM ransomware_posts
        WHERE post_date >= '2021-01-01' AND post_date < '2022-01-01'
        GROUP BY victim_industry, to_char(post_date, 'Mon')
    ) AS r
GROUP BY r.victim_industry

see dbfiddle

edited Jan 04 '23 at 20:33

answered Jan 04 '23 at 17:43

Edouard

6,577
1
9
20

This doesn't work if there's a null result. It starts each row at the earliest month with entries. For example, in the real data set there is a month where there are no results for Automotive. The results for the Automotive industry all shift so that `null` appears in December, even though there are 8 entries for December in the data. – NeodymiumPhish Jan 04 '23 at 17:52
yes this is the standard behiavor of `crosstab` : "The crosstab function produces one output row for each consecutive group of input rows with the same row_name value. It fills the output value columns, left to right, with the value fields from these rows. If there are fewer rows in a group than there are output value columns, the extra output columns are filled with nulls; if there are more rows, the extra input rows are skipped." see the [manual](https://www.postgresql.org/docs/current/tablefunc.html#id-1.11.7.52.5) – Edouard Jan 04 '23 at 18:53
I can provide you an other solution without the crosstab function, I will update my answer – Edouard Jan 04 '23 at 18:54
I see the edits made, this seems to make sense. One question I would have (thanks so much, by the way, this is awesome) would be whether this could be generalized enough so that I could re-run the query referencing another column from the table, such as victim_country, instead of victim_industry. I assume I would need to build a different type? Or could the type be created with a generic term in place of victim_industry in the example code you provided? – NeodymiumPhish Jan 04 '23 at 20:13
In fact it is possible to put `victim_industry` out of the composite type. It is even more simple, and makes the composite type reusable for `victim_country`. See my answer updated. – Edouard Jan 04 '23 at 20:35
This is perfect! Thank you so much! Marked as `answered`!! – NeodymiumPhish Jan 04 '23 at 20:40

Postgres query that shows the results from 2 columns with the highest counts per month

1 Answers1