SQL: Learning About Gap-And-Island Problems

Question

I am working with Netezza SQL.

In a previous question (Replacing CTE's with Individual Queries) I learned about the basics of "Gap and Island Problems" in which the goal is to "fill" missing records for each name.

Suppose there is a table with the names of different people over different years (some of the years are missing). Lets assume that each person has favorite color, favorite food and favorite sport - and this information DOES NOT change over the years. However, the age of each person DOES change each year.

   CREATE TABLE sample_table 
    (
        name VARCHAR(50),
        age INTEGER,
        year INTEGER,
        color VARCHAR(50),
        food VARCHAR(50),
        sport VARCHAR(50)
    );
    
    INSERT INTO sample_table (name, age, year, color, food)
    VALUES ('aaa', 41,  2010, 'Red', 'Pizza', 'hockey');
    
    INSERT INTO sample_table (name, age, year, color, food)
    VALUES ('aaa', 42,  2012, 'Red', 'Pizza', 'hockey');

    INSERT INTO sample_table (name, age, year, color, food)
    VALUES ('aaa', 47, 2017, 'Red', 'Pizza', 'hockey');
    
    INSERT INTO sample_table (name, age, year, color, food)
    VALUES ('bbb', 20 2000, 'Blue', 'Burgers','football');
    
    INSERT INTO sample_table (name, age,  year, color, food)
    VALUES ('bbb', 26,  2006, 'Blue', 'Burgers', 'football');


    INSERT INTO sample_table (name, age, year, color, food)
    VALUES ('bbb', 30, 2010, 'Blue', 'Burgers', 'football');



+------+-----+------+-------+---------+----------+
| name | age | year | color |  food   |  sport   |
+------+-----+------+-------+---------+----------+
| aaa  |  41 | 2010 | Red   | Pizza   | hockey   |
| aaa  |  42 | 2012 | Red   | Pizza   | hockey   |
| aaa  |  47 | 2017 | Red   | Pizza   | hockey   |
| bbb  |  20 | 2000 | Blue  | Burgers | football |
| bbb  |  26 | 2006 | Blue  | Burgers | football |
| bbb  |  30 | 2010 | Blue  | Burgers | football |
+------+-----+------+-------+---------+----------+

In this problem, I am interested in filling missing information (e.g. age, sport, food, color) only between the min and the max year for each person. Specifically, I would like to learn how to do this problem without CTEs and through a "standard query".

Here is what I have attempted so far:

# https://stackoverflow.com/questions/75677585/replacing-ctes-with-individual-queries
    create table years_table (year integer);
    
    insert into years_table(year)
    values(2010);
    insert into years_table(year)
    values(2011);
    insert into years_table(year)
    values(2012);
    insert into years_table(year)
    values(2013);
    insert into years_table(year)
    values(2014);
    insert into years_table(year)
    values(2015);
    insert into years_table(year)
    values(2016);
    insert into years_table(year)
    values(2017);
    insert into years_table(year)
    values(2018);
    insert into years_table(year)
    values(2019);
    insert into years_table(year)
    values(2020);
    
    
    select name, year, max(color) over(partition by name, grp order by year) color, max(sport) over(partition by name, grp order by year) sport, max(food) over(partition by name, grp order by year) food
    from ( 
        select n.name, y.year, t.color, t.food, t.sport
            sum(case when t.name is null then 0 else 1 end) over(partition by n.name order by y.year) grp
        from (
            select name, min(year) min_year, max(year) max_year
            from sample_table
            group by name
        ) n
        inner join years_table y on y.year between n.min_year and n.max_year
        left join sample_table t on t.name = n.name and t.year = y.year
    ) t

But I am not sure how to adapt this SQL code to make the "age" information for each person change over the years.

Can someone please show me how to do this? Preferably, I would like to learn how to do this without Recursive CTE's as they are not supported in Netezza.

Thanks!

Note: The final result should look something like this:

+------+-----+------+-------+---------+----------+
| name | age | year | color |  food   |  sport   |
+------+-----+------+-------+---------+----------+
| aaa  |  41 | 2010 | Red   | Pizza   | hockey   |
| aaa  |  42 | 2011 | Red   | Pizza   | hockey   |
| aaa  |  42 | 2012 | Red   | Pizza   | hockey   |
| aaa  |  43 | 2013 | Red   | Pizza   | hockey   |
| aaa  |  44 | 2014 | Red   | Pizza   | hockey   |
| aaa  |  45 | 2015 | Red   | Pizza   | hockey   |
| aaa  |  46 | 2016 | Red   | Pizza   | hockey   |
| aaa  |  47 | 2017 | Red   | Pizza   | hockey   |
| bbb  |  20 | 2000 | Blue  | Burgers | football |
| bbb  |  21 | 2001 | Blue  | Burgers | football |
| bbb  |  22 | 2002 | Blue  | Burgers | football |
| bbb  |  23 | 2003 | Blue  | Burgers | football |
| bbb  |  24 | 2004 | Blue  | Burgers | football |
| bbb  |  25 | 2005 | Blue  | Burgers | football |
| bbb  |  26 | 2006 | Blue  | Burgers | football |
| bbb  |  27 | 2007 | Blue  | Burgers | football |
| bbb  |  28 | 2008 | Blue  | Burgers | football |
| bbb  |  29 | 2009 | Blue  | Burgers | football |
| bbb  |  30 | 2010 | Blue  | Burgers | football |
+------+-----+------+-------+---------+----------+

The way I'd do it, is to calculate a birth year from each source row, then propagate that in to the missing rows and recalculate the age with Year-BirthYear. That way you don't need to think about inconsistencies in the source data (the results reflect the source, consistent or not). I also don't think you need a gaps-and-islands approach, you can just use CROSS APPLY to find the you to propagate from... (I'll write an answer, but it will be slow, on my phone...) — MatBailie, Mar 09 '23 at 23:44
Your desired results now have age 42 in both 2011 and 2012, due to the inconsistencies in the source data, my answer replicates that behaviour rather than complaining about it. — MatBailie, Mar 10 '23 at 00:06
If you're working with Netezza you should tag it as such. Different SQL implementations hve different syntaxes, features and capabilities. — Tangentially Perpendicular, Mar 10 '23 at 00:18

Atmo · Accepted Answer · 2023-03-10T15:02:38.800

I do not know about Netezza so much but if it supports window functions, you can use a LEAD to get the year of the next record in your table.

SELECT sample_table.*,
       COALESCE(LEAD(year) OVER (partition by name ORDER BY year) -1, year) AS lastyeartogenerate
FROM sample_table

Personally, I would stop the work here and let whatever application is in charge of generating the final table take over from there with a nested loop. In pseudo-code:

for each record in recordset, do {
    for y from year to lastyeartogenerate do {
        [...]
    }
}

That limits the query complexity and the amount of redundant data to be sent over a network.

If you still wish to get the expected table you mention right out of the database, simply join with years_table , although it needs to start on year 2000.

SELECT name, 
       age - t.year + years_table.year AS age,
       years_table.year,
       color,
       food,
       sport
FROM (
SELECT sample_table.*,
       COALESCE(LEAD(year) OVER (partition by name ORDER BY year) -1, year) AS lastyeartogenerate
FROM sample_table
) T
JOIN years_table ON years_table.year BETWEEN T.year AND t.lastyeartogenerate
ORDER BY name, year

Better yet, replace your years_table with something in the line of what is presented here (I must admit it is slightly above what I know of that DBMS and I have no access to one to test):

SELECT name, 
       age + idx AS age,
       year + idx AS year,
       color,
       food,
       sport
FROM (
SELECT sample_table.*,
       COALESCE(LEAD(year) OVER (partition by name ORDER BY year) -1, year) - year AS indicestogenerate
FROM sample_table
) T
JOIN _v_vector_idx ON idx <= indicesToGenerate
ORDER BY name, year + idx

I expect this to work but then again, have no access to a DB to test it.

Even cleaner; use a numbers table rather than a years table (with LEAD still) ; https://dbfiddle.uk/NL-dLkEj — MatBailie, Mar 10 '23 at 00:35
Thanks @MatBailie, I happened to be on my way to edit the answer with exactly that, great minds think alike. It seems Netezza has something to do exactly what you suggest without having to create a table; that is even cleaner. — Atmo, Mar 10 '23 at 00:40
No self-joins and just one window function call +1. Note: maybe Netezza supports replacing this: `COALESCE(LEAD(year) OVER (partition by name ORDER BY year) -1, year)` with `LEAD(year, 1, year + 1) OVER (partition by name ORDER BY year) -1`? — GMB, Mar 10 '23 at 18:06
@ Atmo: thank you so much for your answer! I posted a related question here - can you please take a look at it if you have time? https://stackoverflow.com/questions/75748541/sql-using-case-when-for-missing-rows ... thank you so much! — stats_noob, Mar 15 '23 at 18:19
@ Atmo: I am trying to apply similar logic to this question here - can you please take a look at it if you have time? https://stackoverflow.com/questions/75824265/sql-using-window-aggregate-functions thank you so much! — stats_noob, Mar 24 '23 at 12:50

MatBailie · Answer 2 · 2023-03-10T00:09:49.160

The way I'd do it, is to calculate a birth year from each source row, then propagate that in to the missing rows and recalculate the age with Age = Year - BirthYear and BirthYear = source.year - source.age.

That way you don't need to think about inconsistencies or variation in the source data (the results reflect the source, consistent or not).

I also don't think you need a gaps-and-islands approach, you can just use CROSS APPLY to find the most recent row on or before the current year, and propagate the values from there...

WITH
  person
AS
(
  SELECT
    name,
    MIN(year)   AS min_year,
    MAX(year)   AS max_year
  FROM
    sample_table
  GROUP BY
    name
)
SELECT
  p.name,
  y.year,
  s.color,
  s.food,
  s.sport,
  y.year - (s.year - s.age) AS age
FROM
  person        AS p
INNER JOIN
  years_table   AS y
    ON y.year BETWEEN p.min_year AND p.max_year
CROSS APPLY
(
  SELECT *
    FROM sample_table
   WHERE name  = p.name
     AND year <= y.year
ORDER BY year DESC
   LIMIT 1
)
  AS s

If you really need to avoid CTEs, which I only just read, sorry, you just move the CTE's definition as a sub-query in the main query, it would behave identically.

Demo : https://dbfiddle.uk/G9AMFDdE

(Uses TOP 1 instead of LIMIT 1 so I can abuse SQL Server as a Netezza SQL proxy.)

tinazmu · Answer 3 · 2023-03-10T03:42:22.460

Here is a version with no CTE, no windowing function:

    
select 
      AllCombos.name
    , AllCombos.year
    , coalesce(st.age,stprev.age+(AllCombos.year-stprev.year)) age
    , coalescE(st.food,stprev.food) food
    , coalesce(st.color,stprev.color) color
    , coalesce(st.sport,stprev.sport) sport
from (
        select name, year
        from (
            select name, min(year) min_year, max(year) max_year
            from #sample_table
            group by name
                ) n
        inner join #years_table y on y.year between n.min_year and n.max_year
        ) AllCombos
                
        left join
        #sample_table st
        on st.name=AllCombos.name
        and st.year=AllCombos.year

        left join
        #sample_table stprev
        on stprev.name=AllCombos.name
        and stprev.year<AllCombos.year
        and not exists (select 1
                        from #sample_table morerecent
                        where morerecent.name=stprev.name
                        and morerecent.year<AllCombos.year
                        and morerecent.year>stprev.Year)

It finds all name-year combinations desired to be listed, then attempts (left join) to find an exact match, and if not found, it finds the most recent prior row. It also adjusts the Age by the number of years between (but this can't be accurate, because the records are by year, and a person would be in two different ages in each year).

Why would you want to avoid window functions? And how does this improve on the CROSS APPLY approach? — MatBailie, Mar 10 '23 at 08:26
Hi @matbailie, it is just an alternative. Window functions are relatively new and may not be supported by a particular product. Same with cross apply, I have nothing against either — tinazmu, Mar 10 '23 at 08:49
Also, you don't need the first LEFT JOIN : https://dbfiddle.uk/ApzDb_Sn — MatBailie, Mar 10 '23 at 08:52
The op specified netezza, which specifically does support cross apply and window functions. — MatBailie, Mar 10 '23 at 08:53

score 1 · Answer 4 · answered Mar 10 '23 at 12:31

To start with - I would call this - common - problem a gap filling task. Gaps and Islands are a different animal, in my eyes.

Having said that - also Netezza supports the LAST_VALUE(col IGNORE NULLS) OVER (PARTITION BY ... ORDER BY ...) OLAP function. Working with that, you get a much more readable - and efficient query than a lot of complex self-joins ...

-- complete with the in data, in the shape of Common Table Expressions in a WITH clause ...
WITH
sample_table("name",age,"year",colour,food,sport) AS (
          SELECT 'aaa',41,2010,'Red','Pizza','hockey'
UNION ALL SELECT 'aaa',42,2012,'Red','Pizza','hockey'
UNION ALL SELECT 'aaa',47,2017,'Red','Pizza','hockey'
UNION ALL SELECT 'bbb',20,2000,'Blue','Burgers','football'
UNION ALL SELECT 'bbb',26,2006,'Blue','Burgers','football'
UNION ALL SELECT 'bbb',30,2010,'Blue','Burgers','football'
)
,
years_table("year") AS (
            SELECT 2010 UNION ALL SELECT 2011
  UNION ALL SELECT 2012 UNION ALL SELECT 2013
  UNION ALL SELECT 2014 UNION ALL SELECT 2015
  UNION ALL SELECT 2016 UNION ALL SELECT 2017
  UNION ALL SELECT 2018 UNION ALL SELECT 2019
  UNION ALL SELECT 2020
)
SELECT
  LAST_VALUE("name"   IGNORE NULLS) OVER(order by y."year") AS "name"
,   LAST_VALUE(y."year" IGNORE NULLS) OVER(order by y."year") 
  - LAST_VALUE(s."year" IGNORE NULLS) OVER(order by y."year") 
  + LAST_VALUE(age      IGNORE NULLS) OVER(order by y."year")
  AS age
, y."year"
, LAST_VALUE(colour   IGNORE NULLS) OVER(order by y."year") AS colour
, LAST_VALUE(food     IGNORE NULLS) OVER(order by y."year") AS food  
, LAST_VALUE(sport    IGNORE NULLS) OVER(order by y."year") AS sport 
FROM      years_table  y
LEFT JOIN sample_table s
  ON y."year" >= s."year"
 AND y."year" <= s."year"
ORDER BY name, "year"
;

Result:

name	age	year	colour	food	sport
aaa	42	2,012	Red	Pizza	hockey
aaa	43	2,013	Red	Pizza	hockey
aaa	44	2,014	Red	Pizza	hockey
aaa	45	2,015	Red	Pizza	hockey
aaa	46	2,016	Red	Pizza	hockey
aaa	47	2,017	Red	Pizza	hockey
aaa	48	2,018	Red	Pizza	hockey
aaa	49	2,019	Red	Pizza	hockey
aaa	50	2,020	Red	Pizza	hockey
bbb	30	2,010	Blue	Burgers	football
bbb	30	2,010	Blue	Burgers	football
bbb	31	2,011	Blue	Burgers	football

Few database support the `ignore null` option to first/last_value, and it is a very good fit for the question. I think that this is the most efficient (and neatest) solution to the problem, @stats_noob. — GMB, Mar 10 '23 at 12:39
@GMB, would you mind explaining me why you say 7 calls to `LAST_VALUE` is the most efficient and neatest solution? Is it not more calls to a window function than needed? I do not have access to a database where `ignore null` works so there must be something I do not see. — Atmo, Mar 10 '23 at 15:15
With some time spent on reading the query, trying to understand why it was vastly different from the expected result in the question, I think I spotted some mistakes in it. But in case I am missing an important piece of logic, may I ask you explain what exactly it does to perform the gap filling task? — Atmo, Mar 10 '23 at 15:18
@Atmo: actually I misread *your* query and thought it would not produce the expected results, so I went for that one, which was my second pick overall - because it has no joins. But your query does the job with no joins either, and with less window function calls +1. — GMB, Mar 10 '23 at 18:02
A window function uses the PARTITION BY ... ORDER BY reshuffling for the same window definition once in the same query. It becomes a messy access plan if each window expression is different, as it has to re-shuffle the data for each different window expression. But if it's always the same, no. — marcothesane, Mar 11 '23 at 12:31

SQL: Learning About Gap-And-Island Problems

4 Answers4

Linked