0

I am working with Netezza SQL

I have the following :

CREATE TABLE sample_table 
(
    name VARCHAR(50),
    age INTEGER,
    year INTEGER,
    color VARCHAR(50),
    food VARCHAR(50),
    sport VARCHAR(50),
    source VARCHAR(50)
);
    
INSERT INTO sample_table (name, age, year, color, food, source)
VALUES ('aaa', 41,  2010, 'Red', 'Pizza', 'hockey', 'original');
    
INSERT INTO sample_table (name, age, year, color, food, , source)
VALUES ('aaa', 42,  2012, 'Red', 'Pizza', 'hockey',  'original');

INSERT INTO sample_table (name, age, year, color, food,  source)
VALUES ('aaa', 47, 2017, 'Red', 'Pizza', 'hockey',  'original');
    
INSERT INTO sample_table (name, age, year, color, food, source)
VALUES ('bbb', 20 2000, 'Blue', 'Burgers','football',  'original');
    
INSERT INTO sample_table (name, age,  year, color, food, source)
VALUES ('bbb', 26,  2006, 'Blue', 'Burgers', 'football',  'original');

INSERT INTO sample_table (name, age, year, color, food, source)
VALUES ('bbb', 30, 2010, 'Blue', 'Burgers', 'football',  'original');

Question: I am interested in filling missing information (e.g. age, sport, food, color, source) only between the min and the max year for each person.

When a row is missing for a person:

  • sport, food, color - these always stay the same for the same person
  • age increases by +1 relative to the last row
  • source = "missing"

First what I did was create a table of years that contains a range between the minimum and maximum years for all years within the table:

INSERT INTO years_table (year) VALUES (2010);
INSERT INTO years_table (year) VALUES (2011);
INSERT INTO years_table (year) VALUES (2012);
INSERT INTO years_table (year) VALUES (2013);
INSERT INTO years_table (year) VALUES (2014);
INSERT INTO years_table (year) VALUES (2015);
INSERT INTO years_table (year) VALUES (2016);
INSERT INTO years_table (year) VALUES (2017);
INSERT INTO years_table (year) VALUES (2018);
INSERT INTO years_table (year) VALUES (2019);
INSERT INTO years_table (year) VALUES (2020);

Then, I tried to modify the answer provided here (SQL: Learning About Gap-And-Island Problems) :

SELECT name,
       age - t.year + years_table.year AS age,
       years_table.year,
       color,
       food,
       sport,
       CASE WHEN T.year = (SELECT MAX(year) FROM sample_table) THEN 'NEW' ELSE 'ORIGINAL' END AS source
FROM (
SELECT sample_table.*,
       COALESCE(LEAD(year) OVER (partition by name ORDER BY year) -1, year) AS lastyeartogenerate
FROM sample_table
) T
JOIN years_table ON years_table.year BETWEEN T.year AND t.lastyeartogenerate
ORDER BY name, year;

Problem: But I am not sure if I am using the CASE WHEN statement properly. The query runs, but I can see that some newly added rows have a value of SOURCE = ORIGINAL when they should have SOURCE = NEW.

Can someone please show me how to fix this?

Thanks!

stats_noob
  • 5,401
  • 4
  • 27
  • 83

1 Answers1

1

There are issues in your code (source is not declared on your table CREATE command, the INSERT will not work because of it and because you are missing the sport column in them). From your attempt at modifying my previous query, I will assume the source column does not exist.

The important bits in the query are COALESCE(LEAD(year) OVER (partition by name ORDER BY year) -1, year) and years_table.year BETWEEN t.year AND t.lastyeartogenerate with t.year being the original sample_table.year. That is basically where the term missing information is defined in the code.
With that it mind, it is easy to deduce:

  • SOURCE = ORIGINAL when years_table.year = t.year
    Reminder: as per the above, this simply means years_table.year = sample_table.year.
  • Oppositely, SOURCE = NEW when years_table.year > t.year.
    Reminder: Thanks to the LEAD(...) - 1 and the JOIN, all the values of year_table.year between t.year + 1 (= sample_table.year + 1) and t.lastyeartogenerate (both bound included) are absent from sample_table.

Therefore, your CASE WHEN should be:

CASE WHEN years_table.year = t.year THEN 'ORIGINAL' ELSE 'NEW' END

For the alternative and IMO better version of the query I had provided (the one that uses _v_vector_idx):

CASE WHEN idx = 0 THEN 'ORIGINAL' ELSE 'NEW' END

Alternative solution: you can scrap the CASE WHEN entirely and reuse the query in a UNION ALL. Note that the JOIN must be changed in the code query to exclude the original records (they come from before the union):

SELECT *, 'ORIGINAL' AS Source
FROM sample_table
UNION ALL
SELECT name,
       age - t.year + years_table.year,
       years_table.year,
       color,
       food,
       sport,
       'NEW'
FROM (  
SELECT sample_table.*,
       COALESCE(LEAD(year) OVER (partition by name ORDER BY year) -1, year) AS lastyeartogenerate
FROM sample_table
) T
JOIN years_table ON years_table.year BETWEEN T.year + 1 AND t.lastyeartogenerate
ORDER BY name, year

For the other query, it means you have to change the JOIN to

ON idx BETWEEN 1 AND indicesToGenerate
Atmo
  • 2,281
  • 1
  • 2
  • 21
  • 1) Suppose I correctly add the "source" column in the original table. I can then use the following query: – stats_noob Mar 16 '23 at 00:37
  • SELECT name, age - t.year + years_table.year AS age, years_table.year, color, food, sport, CASE WHEN years_table.year = t.year THEN 'ORIGINAL' ELSE 'NEW' END FROM ( SELECT sample_table.*, COALESCE(LEAD(year) OVER (partition by name ORDER BY year) -1, year) AS lastyeartogenerate FROM sample_table ) T JOIN years_table ON years_table.year BETWEEN T.year AND t.lastyeartogenerate ORDER BY name, year; – stats_noob Mar 16 '23 at 00:37
  • Have I understood this correctly? – stats_noob Mar 16 '23 at 00:38
  • 2) The other query you provided works from the original table (i.e. no "source" column) : – stats_noob Mar 16 '23 at 00:38
  • SELECT *, 'ORIGINAL' AS Source FROM sample_table UNION ALL SELECT name, age - t.year + years_table.year, years_table.year, color, food, sport, 'NEW' FROM ( SELECT sample_table.*, COALESCE(LEAD(year) OVER (partition by name ORDER BY year) -1, year) AS lastyeartogenerate FROM sample_table ) T JOIN years_table ON years_table.year BETWEEN T.year + 1 AND t.lastyeartogenerate ORDER BY name, year – stats_noob Mar 16 '23 at 00:39
  • Have I understood this correctly? And why do you think this second alternative is better? Thank you so much! – stats_noob Mar 16 '23 at 00:39
  • 1
    Regarding [add the `source` column in the original table.], you seem not to use the column anyway. Imagine I asked you to bet on the result of a coin toss with a coin with heads on both sides: do you actually need me to toss the coin to know the result? Do you even need the coin at all? Same here: a column containing only 1 value for all the records needs not exist. I would even say creating it makes the query harder to write and less clear. And think what the rationale would be to have a column for some records and 1 hardcoded value for others. – Atmo Mar 16 '23 at 14:08
  • 1
    Other question: _v_vector_idx is a preexisting table in Netezza so it allows you to do everything with less work. Moreover, saving years rather than indices would force you to constantly add more records (as time passes). With integers, you can safely assume you will never need more than e.g. 1000 of them, or in other words, you can safely assume nobody will live for more than 1000 years, ever. Just look at the fact `years_table` currently does not cover the period for `bbb`. If it was not on purpose (= if you failed to do correct initialisation / maintenance), that proves my point. – Atmo Mar 16 '23 at 14:11
  • @ Atmo: thank you so much for your replies! Thank you or teaching me about the _v_vector_idx table... I am trying to apply it in this question here: https://stackoverflow.com/questions/75758860/sql-adding-an-id-column-to-a-table – stats_noob Mar 16 '23 at 16:20