0

I am working with Netezza SQL.

I have the following table:

CREATE TABLE sample_table 
(
    name VARCHAR(50),
    year INTEGER,
    color VARCHAR(50)
);

INSERT INTO sample_table (name, year, color)
VALUES ('aaa', 2010, 'Red');

INSERT INTO sample_table (name, year, color)
VALUES ('aaa', 2012, 'Red');

INSERT INTO sample_table (name, year, color)
VALUES ('bbb', 2014, 'Blue');

INSERT INTO sample_table (name, year, color)
VALUES ('bbb', 2016, 'Blue');

The table looks something like this:

+------+--------+-------+
| name |  year  | color |
+------+--------+-------+
| aaa  |  2010  |  Red  |
| aaa  |  2012  |  Red  |
| bbb  |  2014  | Blue  |
| bbb  |  2016  | Blue  |
+------+--------+-------+

As we can see:

  • "aaa" has a missing row between 2010 and 2012 (i.e. 2011)
  • "bbb" has a missing row between 2014 and 2016 (i.e. 2015)

My question: I want to write a SQL query that adds these missing rows for both of these names (assume that the "color" for each "name" remains the same). The final output should look something like this:

+------+--------+-------+
| name |  year  | color |
+------+--------+-------+
| aaa  |  2010  |  Red  |
| aaa  |  2011  |  Red  |
| aaa  |  2012  |  Red  |
| bbb  |  2014  | Blue  |
| bbb  |  2015  | Blue  |
| bbb  |  2016  | Blue  |
+------+--------+-------+

In a previous question, I learned how to approach a similar problem using Recursive CTE's (Adding Missing Rows in SQL using JOINS). Could someone please show me how I could try to solve this problem by creating intermediate tables? (e.g. temp_tab1, temp_tab2, ... drop temp_tab1)

I am still learning SQL and find it easier to follow smaller chunks of code

Thanks!

GMB
  • 216,147
  • 25
  • 84
  • 135
stats_noob
  • 5,401
  • 4
  • 27
  • 83
  • 1
    If the color was Red in 2010 and Blue in 2012, what color is the missing row? What if there were several missing rows? – Stu Mar 08 '23 at 19:55
  • 2
    There is no replacement for a recursive CTE. It's a special beast all it's own. This is even further complicated by the fact that this is a gaps-and-islands problem ( think). Is it possible in your data to have `aaa | 2017 | red` as the next record? If so would you need to generate records for between 2013 and 2016 for red, or would you skip 2014-2016 becuase `blue` already has those years? – JNevill Mar 08 '23 at 19:58
  • @ Stu: thank you for your reply! Lets assume that for each name, the color always stays the same. – stats_noob Mar 08 '23 at 20:07
  • @ JNevill: thank you for your reply! For each name, i only want to fill records between the earliest and latest record. – stats_noob Mar 08 '23 at 20:08

2 Answers2

2

To avoid the recursion, you could have a table that lists all possible years, and that you would use to generate the "missing" rows of each product, say years_table(years).

We would typically generate the start and end year of each name in a subquery, then draw all years in between from the years table, and finally check if we have a match in the original data, with a left join;

select n.name, y.year, t.color
from (
    select name, min(year) min_year, max(year) max_year
    from sample_table
    group by name
) n
inner join years_table y on y.year between n.min_year and n.max_year
left join sample_table t on t.name = n.name and t.year = y.year
    

This works, but does not gives you a color on rows that were "added" by the query logic. For this, we would need quite a lot more work (that's a gaps and islands problem). Here is one way to do it:

select name, year, max(color) over(partition by name, grp order by year) color
from ( 
    select n.name, y.year, t.color, 
        sum(case when t.name is null then 0 else 1 end) over(partition by n.name order by y.year) grp
    from (
        select name, min(year) min_year, max(year) max_year
        from sample_table
        group by name
    ) n
    inner join years_table y on y.year between n.min_year and n.max_year
    left join sample_table t on t.name = n.name and t.year = y.year
) t

That's yet another reason why I would advocate in favor of a recursive query here. The idea is to identifiy the gaps in the date series, and use the recursion to generate only the missing rows. Since the recursive query starts from an anchor, it is easy to keep track of the "current" color.

with recursive
    data as (
        select t.*, lead(year) over(partition by name order by year) lead_year
        from sample_table t
    ),
    rec as (
        select name, year,     color, lead_year from data where lead_year > year + 1
        union all
        select name, year + 1, color, lead_year from rec  where lead_year > year + 1
    )
select name, year, color from data where lead_year is null or lead_year = year + 1
union all
select name, year, color from rec
order by name, year

Here:

  • data looks up the "next" available year, which allows identifying gaps
  • rec starts from the beginning of each gap, and generate new rows until the next island is reached, keeping track of the original colors
  • the outer query unions together the original islands and the gaps that we filled

Here is a demo on DB Fiddle with a few rows added to your data, and both solutions at work.

Input:

name year color
aaa 2010 Red
aaa 2012 Red
aaa 2013 Blue
aaa 2016 Blue
bbb 2014 Blue
bbb 2016 Blue

Output:

name year color
aaa 2010 Red
aaa 2011 Red
aaa 2012 Red
aaa 2013 Blue
aaa 2014 Blue
aaa 2015 Blue
aaa 2016 Blue
bbb 2014 Blue
bbb 2015 Blue
bbb 2016 Blue
GMB
  • 216,147
  • 25
  • 84
  • 135
  • @ GMB: Thank you so much for your answer! Another reason that I wanted to learn alternate ways to solve this problem is that because Netezza does not allow WITH RECURSIVE CTEs . – stats_noob Mar 09 '23 at 01:14
  • @ GMB: I am trying to adapt your code in a more simpler format for this problem over here: https://stackoverflow.com/questions/75824265/sql-using-window-aggregate-functions/75824321 - can you please take a look at it if you have time? thank you so much! – stats_noob Mar 23 '23 at 15:55
1

Your basic "problem" is that you haven#t the year 200- 2009 in your year table

Change it to fit your data, you get your wanted result, i

if you need more years, you must change the start date for the CTE

   CREATE TABLE sample_table 
    (
        name VARCHAR(50),
        age INTEGER,
        year INTEGER,
        color VARCHAR(50),
        food VARCHAR(50),
        sport VARCHAR(50)
    );
    
    INSERT INTO sample_table (name, age, year, color, food,sport)
    VALUES ('aaa', 41,  2010, 'Red', 'Pizza', 'hockey');
    
    INSERT INTO sample_table (name, age, year, color, food,sport)
    VALUES ('aaa', 42,  2012, 'Red', 'Pizza', 'hockey');

    INSERT INTO sample_table (name, age, year, color, food,sport)
    VALUES ('aaa', 47, 2017, 'Red', 'Pizza', 'hockey');
    
    INSERT INTO sample_table (name, age, year, color, food,sport)
    VALUES ('bbb', 20, 2000, 'Blue', 'Burgers','football');
    
    INSERT INTO sample_table (name, age,  year, color, food,sport)
    VALUES ('bbb', 26,  2006, 'Blue', 'Burgers', 'football');


    INSERT INTO sample_table (name, age, year, color, food,sport)
    VALUES ('bbb', 30, 2010, 'Blue', 'Burgers', 'football');

  create table years_table (year integer);
    

with recursive yearlist(year) as 
(
    select 2000 as year
    union all
    select yl.year + 1 as year
    from yearlist yl
    where yl.year + 1 <= date_part('year',CURRENT_DATE)
)
    insert into years_table(year)
select year from yearlist order by year desc;;
    
 
CREATE TABLE
INSERT 0 24
    
    select name, year
      , max(color) over(partition by name, grp order by year) color
      , max(sport) over(partition by name, grp order by year) sport
      , max(food) over(partition by name, grp order by year) food
    from ( 
        select n.name, y.year, t.color, t.food, t.sport,
            sum(case when t.name is null then 0 else 1 end) over(partition by n.name order by y.year) grp
        from (
            select name, min(year) min_year, max(year) max_year
            from sample_table
            group by name
        ) n
        inner join years_table y on y.year between n.min_year and n.max_year
        left join sample_table t on t.name = n.name and t.year = y.year
    ) t
name year color sport food
aaa 2010 Red hockey Pizza
aaa 2011 Red hockey Pizza
aaa 2012 Red hockey Pizza
aaa 2013 Red hockey Pizza
aaa 2014 Red hockey Pizza
aaa 2015 Red hockey Pizza
aaa 2016 Red hockey Pizza
aaa 2017 Red hockey Pizza
bbb 2000 Blue football Burgers
bbb 2001 Blue football Burgers
bbb 2002 Blue football Burgers
bbb 2003 Blue football Burgers
bbb 2004 Blue football Burgers
bbb 2005 Blue football Burgers
bbb 2006 Blue football Burgers
bbb 2007 Blue football Burgers
bbb 2008 Blue football Burgers
bbb 2009 Blue football Burgers
bbb 2010 Blue football Burgers
SELECT 19

fiddle

nbk
  • 45,398
  • 8
  • 30
  • 47
  • that is much esasier than cmomplicated gaps and island solutions – nbk Mar 23 '23 at 16:16
  • I am working on a related question here - can you please take a look at it if you have time? https://stackoverflow.com/questions/75824265/sql-using-window-aggregate-functions thank you so much! – stats_noob Mar 23 '23 at 16:43