I am working with Netezza SQL
I have the following :
CREATE TABLE sample_table
(
name VARCHAR(50),
age INTEGER,
year INTEGER,
color VARCHAR(50),
food VARCHAR(50),
sport VARCHAR(50),
source VARCHAR(50)
);
INSERT INTO sample_table (name, age, year, color, food, source)
VALUES ('aaa', 41, 2010, 'Red', 'Pizza', 'hockey', 'original');
INSERT INTO sample_table (name, age, year, color, food, , source)
VALUES ('aaa', 42, 2012, 'Red', 'Pizza', 'hockey', 'original');
INSERT INTO sample_table (name, age, year, color, food, source)
VALUES ('aaa', 47, 2017, 'Red', 'Pizza', 'hockey', 'original');
INSERT INTO sample_table (name, age, year, color, food, source)
VALUES ('bbb', 20 2000, 'Blue', 'Burgers','football', 'original');
INSERT INTO sample_table (name, age, year, color, food, source)
VALUES ('bbb', 26, 2006, 'Blue', 'Burgers', 'football', 'original');
INSERT INTO sample_table (name, age, year, color, food, source)
VALUES ('bbb', 30, 2010, 'Blue', 'Burgers', 'football', 'original');
Question: I am interested in filling missing information (e.g. age, sport, food, color, source) only between the min and the max year for each person.
When a row is missing for a person:
- sport, food, color - these always stay the same for the same person
- age increases by +1 relative to the last row
- source = "missing"
First what I did was create a table of years that contains a range between the minimum and maximum years for all years within the table:
INSERT INTO years_table (year) VALUES (2010);
INSERT INTO years_table (year) VALUES (2011);
INSERT INTO years_table (year) VALUES (2012);
INSERT INTO years_table (year) VALUES (2013);
INSERT INTO years_table (year) VALUES (2014);
INSERT INTO years_table (year) VALUES (2015);
INSERT INTO years_table (year) VALUES (2016);
INSERT INTO years_table (year) VALUES (2017);
INSERT INTO years_table (year) VALUES (2018);
INSERT INTO years_table (year) VALUES (2019);
INSERT INTO years_table (year) VALUES (2020);
Then, I tried to modify the answer provided here (SQL: Learning About Gap-And-Island Problems) :
SELECT name,
age - t.year + years_table.year AS age,
years_table.year,
color,
food,
sport,
CASE WHEN T.year = (SELECT MAX(year) FROM sample_table) THEN 'NEW' ELSE 'ORIGINAL' END AS source
FROM (
SELECT sample_table.*,
COALESCE(LEAD(year) OVER (partition by name ORDER BY year) -1, year) AS lastyeartogenerate
FROM sample_table
) T
JOIN years_table ON years_table.year BETWEEN T.year AND t.lastyeartogenerate
ORDER BY name, year;
Problem: But I am not sure if I am using the CASE WHEN statement properly. The query runs, but I can see that some newly added rows have a value of SOURCE = ORIGINAL when they should have SOURCE = NEW.
Can someone please show me how to fix this?
Thanks!