1

I am working with Netezza SQL. I have the following table:

CREATE TABLE MY_TABLE (
    id VARCHAR(50),
    year VARCHAR(50)
);

INSERT INTO MY_TABLE (id, year)
VALUES ('123', '2010');
INSERT INTO MY_TABLE (id, year)
VALUES ('123', '2011');
INSERT INTO MY_TABLE (id, year)
VALUES ('123', '2012'); 

INSERT INTO MY_TABLE (id, year)
VALUES ('124', '2011');
INSERT INTO MY_TABLE (id, year)
VALUES ('124', '2012');
INSERT INTO MY_TABLE (id, year)
VALUES ('124', '2013');


INSERT INTO MY_TABLE (id, year)
VALUES ('125', '2011');
INSERT INTO MY_TABLE (id, year)
VALUES ('125', '2012');
INSERT INTO MY_TABLE (id, year)
VALUES ('125', '2013');

  INSERT INTO MY_TABLE (id, year)
    VALUES ('127', '2011');
    INSERT INTO MY_TABLE (id, year)
    VALUES ('127', '2012');
    INSERT INTO MY_TABLE (id, year)
    VALUES ('127', '2015');

  INSERT INTO MY_TABLE (id, year)
    VALUES ('126', '2019');

My Question: I am trying to find out the number of times each combination of years appears. The final result would look like this:

       combination freq
1   2011,2012,2015    1
2   2011,2012,2013    2
3 2010, 2011, 2012    1
4             2019    1

Here is my attempt at doing this in Netezza:

WITH CTE AS (
    SELECT id, year, ROW_NUMBER() OVER (PARTITION BY id ORDER BY year) AS rn
    FROM MY_TABLE
),
CTE2 AS (
    SELECT id, MAX(rn) AS max_rn
    FROM CTE
    GROUP BY id
),
CTE3 AS (
    SELECT CTE2.id, CTE.year, CTE.rn, CTE2.max_rn
    FROM CTE2
    JOIN CTE ON CTE2.id = CTE.id
),
CTE4 AS (
    SELECT id,
        MAX(CASE WHEN rn = 1 THEN year END) ||
        MAX(CASE WHEN rn = 2 THEN ',' || year END) ||
        MAX(CASE WHEN rn = 3 THEN ',' || year END) ||
        MAX(CASE WHEN rn = 4 THEN ',' || year END) ||
        MAX(CASE WHEN rn = 5 THEN ',' || year END) ||
        MAX(CASE WHEN rn = 6 THEN ',' || year END) ||
        MAX(CASE WHEN rn = 7 THEN ',' || year END) ||
        MAX(CASE WHEN rn = 8 THEN ',' || year END) ||
        MAX(CASE WHEN rn = 9 THEN ',' || year END) ||
        MAX(CASE WHEN rn = 10 THEN ',' || year END)
        AS combination
    FROM CTE3
    GROUP BY id
)
SELECT combination, COUNT(*) AS freq
FROM CTE4
GROUP BY combination
ORDER BY freq DESC;

But the calculations are completely wrong:

                                       combination freq
1                                              <NA>   94
2 2010,2012,2013,2014,2015,2016,2017,2018,2019,2020    1
3 2010,2011,2012,2013,2014,2015,2016,2017,2019,2020    1
4 2010,2011,2012,2013,2014,2016,2017,2018,2019,2020    1
5 2010,2011,2012,2014,2015,2016,2017,2018,2019,2020    1
6 2010,2011,2013,2014,2015,2016,2017,2018,2019,2020    1
7 2010,2011,2012,2013,2014,2015,2016,2017,2018,2019    1

Can someone please show me how to fix this?

Thanks!

  • Note 1: I am using rn = 10 to account for the min/max range of years (which in my case is 10). I only have years 2010, 2011, ... 2020 in my dataset. Based on this information - is it possible to write a manual solution for individual years using UNION ALL style statements?

  • Note 2: Unfortunately, there is no GROUP_CONCAT function in Netezza - otherwise the following code would have worked:

Here is a sample:

  SELECT GROUP_CONCAT(year, ',' ORDER BY year) AS combination, COUNT(*) AS freq
    FROM my_data
    GROUP BY id
    ORDER BY freq DESC;
  • Note 3: Here is R code to accomplish a similar task:

Source: Counting Number of Unique Column Values Per Group

agg <- aggregate(year ~ id, my_table, paste, collapse = ", ")
final = as.data.frame(table(agg$year))
stats_noob
  • 5,401
  • 4
  • 27
  • 83
  • Did you intend to have two sets for ID 125? – June7 Jul 06 '23 at 22:17
  • @ June7: thank you for your reply! my apologies - just corrected that right now! – stats_noob Jul 06 '23 at 22:19
  • From what I can find, need a UDF to concatenate the data for each ID then perform aggregate Count of that field https://stackoverflow.com/questions/27543129/listagg-or-wm-concat-in-netezzat. I use Access and VBA. What coding language do you use? – June7 Jul 06 '23 at 22:26
  • @June7, netezza sql she\he says. – Cetin Basoz Jul 06 '23 at 22:32
  • So Netezza is more than just a db repository and has its own programming language - like Access and VBA? Which with I certainly had no problem getting the desired output. – June7 Jul 06 '23 at 22:37
  • Does this answer your question? [Concatenating Column Values into a Comma-Separated List in Netezza](https://stackoverflow.com/questions/29046975/concatenating-column-values-into-a-comma-separated-list-in-netezza) – June7 Jul 06 '23 at 22:41
  • @June7, access (which I don't classify as a database personally) is on a different league. Other SQL databases have ANSI SQL as their language basically (SQL in general and T-SQL for MS SQL server, PL/pgSQL for postgreSQL etc) – Cetin Basoz Jul 06 '23 at 22:44
  • @June7, in access getting it would be a problem really. As far as I know access doesn't have a function like string_agg (postgreSQL, MS SQL server), group_concat (MySQL). However string_agg() was added to MS SQL server relatively recently compared to others, they may have added to access as well. For that, in access you have to define and use a UDF - it looks easy to do but not the same thing really. – Cetin Basoz Jul 06 '23 at 22:53
  • Correct, Access SQL does not support - requires a VBA UDF. As I said, I have done it with this sample data. – June7 Jul 06 '23 at 23:05
  • @June7 Your link leads to a dead link. Apparently it is to the same GROUP_CONCAT function I was just reading about that was written in C++. (dead links in that post, too) If the OP can get the code (I'm not finding it quickly), they would have to be able to compile and install it on the Netezza server. – dougp Jul 06 '23 at 23:18
  • Do you know the maximum number of years that can be associated with an ID? If so, https://dwgeek.com/netezza-recursive-query-alternative-examples.html/ might work for you. – dougp Jul 06 '23 at 23:19
  • @ dougp: The maximum number of years is 10 . Do you think I could manually write a statement for each year (e.g. 2005, 2006...2015) and then use UNION ALL? – stats_noob Jul 06 '23 at 23:25
  • Ooops, I did not actually read that second source. Sorry. – June7 Jul 07 '23 at 00:02

2 Answers2

3

Since Netezza lacks LISTAGG, STRING_AGG, and GROUP_CONCAT, this is a bit more challenging.

Reading that Netezza software was based on PostgreSQL 7.2, I created this against 9.6 -- the oldest available on dbfiddle.uk.

CREATE TABLE MY_TABLE (
    id VARCHAR(50),
    year VARCHAR(50)
);

INSERT INTO MY_TABLE (id, year) VALUES ('123', '2010');
INSERT INTO MY_TABLE (id, year) VALUES ('123', '2011');
INSERT INTO MY_TABLE (id, year) VALUES ('123', '2012'); 

INSERT INTO MY_TABLE (id, year) VALUES ('124', '2011');
INSERT INTO MY_TABLE (id, year) VALUES ('124', '2012');
INSERT INTO MY_TABLE (id, year) VALUES ('124', '2013');

INSERT INTO MY_TABLE (id, year) VALUES ('125', '2011');
INSERT INTO MY_TABLE (id, year) VALUES ('125', '2012');
INSERT INTO MY_TABLE (id, year) VALUES ('125', '2013');

INSERT INTO MY_TABLE (id, year) VALUES ('127', '2011');
INSERT INTO MY_TABLE (id, year) VALUES ('127', '2012');
INSERT INTO MY_TABLE (id, year) VALUES ('127', '2015');

INSERT INTO MY_TABLE (id, year) VALUES ('126', '2019');


WITH CTE AS (
  SELECT id, year, ROW_NUMBER() OVER (PARTITION BY id ORDER BY year) AS rn
  FROM MY_TABLE
),
CTE3 AS (
  select id
  , year
  , rn
  , max(rn) over (partition by id) as MaxRN
  from cte
),
CTE4 AS (  
  SELECT id
  , coalesce(max(CASE WHEN rn = 1 THEN year END), '') ||
    coalesce(max(CASE WHEN rn = 2  THEN ',' || year END), '') ||
    coalesce(max(CASE WHEN rn = 3  THEN ',' || year END), '') ||
    coalesce(max(CASE WHEN rn = 4  THEN ',' || year END), '') ||
    coalesce(max(CASE WHEN rn = 5  THEN ',' || year END), '') ||
    coalesce(max(CASE WHEN rn = 6  THEN ',' || year END), '') ||
    coalesce(max(CASE WHEN rn = 7  THEN ',' || year END), '') ||
    coalesce(max(CASE WHEN rn = 8  THEN ',' || year END), '') ||
    coalesce(max(CASE WHEN rn = 9  THEN ',' || year END), '') ||
    coalesce(max(CASE WHEN rn = 10 THEN ',' || year END), '')
    AS combination
  FROM CTE3
  GROUP BY id
)

SELECT combination
, count(id) as freq
FROM CTE4
GROUP BY combination

COALESCE() is needed because SELECT 'a' || NULL returns NULL. Without this the result would be

combination freq
null 5
dougp
  • 2,810
  • 1
  • 8
  • 31
1

(Would be a mess in comments) I don't use Netezza, nor have any access to a Netezza fiddle\playground to test this. Just from documentation it looks like this would do:

with yeargroups (years) as 
(
SELECT LISTAGG(cast(year as varchar(400)), ', ') WITHIN GROUP(ORDER BY year)
      FROM My_Table 
      GROUP BY id
)
select years as Combination, count(*) as Frequency
from yeargrooups
group by years;

PS: varchar(400) is for accommodating to 100 years (100*4) as per Netezza doc.

June7
  • 19,874
  • 8
  • 24
  • 34
Cetin Basoz
  • 22,495
  • 3
  • 31
  • 39
  • @ Cetin Basoz: Thank you so much for your answer! Unfortunately, it looks like this does not work :( – stats_noob Jul 06 '23 at 23:26
  • @Cetin Please provide a link to the documentation for Netezza's `LISTAGG` function. – dougp Jul 06 '23 at 23:45
  • @dougp, my bad. I didn't really pay attention the result when I googled "netezza listagg". It was IBM db2 not IBM netezza. Looks like netezza is even rare than db2. Why would they have it in db2 but not in another product :) In netezza documentation, try filtering, ie: just type yyy, li, or xxx and kaboom. – Cetin Basoz Jul 06 '23 at 23:54
  • @dougp,wikipedia says "Netezza software was based on PostgreSQL 7.2,[11] but did not maintain compatibility". Too bad for them. PostgreSQL got string_agg at 9.0. – Cetin Basoz Jul 07 '23 at 00:02