1

I have this dataset in R:

name = c("john", "john", "john", "sarah", "sarah", "peter", "peter", "peter", "peter")
year = c(2010, 2011, 2014, 2010, 2015, 2011, 2012, 2013, 2015)
age = c(21, 22, 25, 55, 60, 61, 62, 63, 65)
gender = c("male", "male", "male", "female", "female", "male", "male", "male", "male" )
country_of_birth = c("australia", "australia", "australia", "uk", "uk", "mexico", "mexico", "mexico", "mexico")
source = "ORIGINAL"

my_data = data.frame(name, year, age, gender, country_of_birth, source)

As we can see, some of the people in this dataset have rows with missing years (e.g. John goes from 2011 to 2014):

   name year age gender country_of_birth   source
1  john 2010  21   male        australia ORIGINAL
2  john 2011  22   male        australia ORIGINAL
3  john 2014  25   male        australia ORIGINAL
4 sarah 2010  55 female               uk ORIGINAL
5 sarah 2015  60 female               uk ORIGINAL
6 peter 2011  61   male           mexico ORIGINAL
7 peter 2012  62   male           mexico ORIGINAL
8 peter 2013  63   male           mexico ORIGINAL
9 peter 2015  65   male           mexico ORIGINAL

I have this code that is able to add these missing rows by "interpolating" logical values of the missing rows (e.g. age increases by 1, country_of_birth stays the same, etc.), and records if this row was added later or original:

library(tidyverse)
library(dplyr)

# R Code to Convert into SQL
final = my_data %>% 
    group_by(name) %>% 
    complete(year = first(year): last(year)) %>% 
    mutate(age = ifelse(is.na(age), first(age)+row_number()-1, age)) %>% 
    fill(c(gender, country_of_birth), .direction = "down") %>% 
    mutate(source = ifelse(is.na(source), "NOT ORIGINAL", source))

# A tibble: 16 x 6
# Groups:   name [3]
   name   year   age gender country_of_birth source      
   <chr> <dbl> <dbl> <chr>  <chr>            <chr>       
 1 john   2010    21 male   australia        ORIGINAL    
 2 john   2011    22 male   australia        ORIGINAL    
 3 john   2012    23 male   australia        NOT ORIGINAL

My Question: I am trying to learn how I can convert this above code into (Netezza) SQL Code.

To get an idea how to begin, I thought I could use the "dbplyr" library in R to convert my "dplyr" code into SQL:

library(dbplyr)

# attempt 1
remote_df = tbl_lazy(my_data, con = simulate_mysql())

remote_df %>% 
    group_by(name) %>% 
    complete(year = first(year): last(year)) %>% 
    mutate(age = ifelse(is.na(age), first(age)+row_number()-1, age)) %>% 
    fill(c(gender, country_of_birth), .direction = "down") %>% 
    mutate(source = ifelse(is.na(source), "MISSING", source))  %>% show_query()

# attempt 2
remote_df = tbl_lazy(my_data, con = simulate_mssql())

remote_df %>% 
    group_by(name) %>% 
    complete(year = first(year): last(year)) %>% 
    mutate(age = ifelse(is.na(age), first(age)+row_number()-1, age)) %>% 
    fill(c(gender, country_of_birth), .direction = "down") %>% 
    mutate(source = ifelse(is.na(source), "MISSING", source))  %>% show_query()

# attempt 3

 con <- DBI::dbConnect(RSQLite::SQLite(), ":memory:")

 remote_df <- copy_to(con, my_data)

remote_df %>% 
    group_by(name) %>% 
    complete(year = first(year): last(year)) %>% 
    mutate(age = ifelse(is.na(age), first(age)+row_number()-1, age)) %>% 
    fill(c(gender, country_of_birth), .direction = "down") %>% 
    mutate(source = ifelse(is.na(source), "NOT ORIGINAL", source))



# attempt 4

memdb_frame(my_data) %>% 
    group_by(name) %>% 
    complete(year = first(year): last(year)) %>% 
    mutate(age = ifelse(is.na(age), first(age)+row_number()-1, age)) %>% 
    fill(c(gender, country_of_birth), .direction = "down") %>% 
    mutate(source = ifelse(is.na(source), "MISSING", source))  %>% show_query()

However all these attempts give me the same error:

Error in `fill()`:
x `.data` does not have explicit order.
i Please use `arrange()` or `window_order()` to make determinstic.
Run `rlang::last_error()` to see where the error occurred.

Can someone please show me what I am doing wrong and what I can do to convert this R code into SQL code? I had thought that perhaps I could find out which rows are missing for which person, create these rows - and then somehow use JOINS to bring them back to the original dataset in SQL.

Thanks!

stats_noob
  • 5,401
  • 4
  • 27
  • 83

1 Answers1

2

In SQL you can create a "Cartesian product" of rows through a cross join. I believe the equivalent in R is merge() . For the years that are missing, in SQL, you would need a table or resultset, but you should be able to use a sequence in R

Incorporating a sequence of years and the merge() function:

library(tidyverse)
library(dplyr)

# Create a data frame with the sequence of years
years_df <- data.frame(year = seq(2010, 2023))

# Perform a cross join with the original data
final <- merge(my_data, years_df, all = TRUE) %>% 
    group_by(name) %>% 
    complete(year = first(year): last(year)) %>% 
    mutate(age = ifelse(is.na(age), first(age)+row_number()-1, age)) %>% 
    fill(c(gender, country_of_birth), .direction = "down") %>% 
    mutate(source = ifelse(is.na(source), "NOT ORIGINAL", source))

The above is untested!

SQL Code:

CREATE TABLE mytable (
    name VARCHAR(255),
    year INTEGER,
    age INTEGER,
    gender VARCHAR(255),
    country_of_birth VARCHAR(255),
    source VARCHAR(255)
);

INSERT INTO mytable (name, year, age, gender, country_of_birth, source) VALUES ('john', 2010, 21, 'male', 'australia', 'ORIGINAL');
INSERT INTO mytable (name, year, age, gender, country_of_birth, source) VALUES ('john', 2011, 22, 'male', 'australia', 'ORIGINAL');
etc.

Example query with join that lists all years and joins to data when there is a match:

WITH RECURSIVE years (year) AS (
    SELECT 2010
    UNION ALL
    SELECT year + 1
    FROM years
    WHERE year < 2023
)
SELECT 
    t.name, years.year, t.age, t.gender, t.country_of_birth, t.source
FROM years
LEFT JOIN mytable AS t ON years.year = t.year
;

When I reconsidered this you don't need a cross join. Instead you need a left join or it can be expressed as left outer join (in SQL).

In R, you can perform a left join using the merge function. Here’s an example:

# Create two example data frames
df1 <- data.frame(id = c(1, 2, 3), x = c("a", "b", "c"))
df2 <- data.frame(id = c(2, 3, 4), y = c("d", "e", "f"))

# Perform a left join
left_join <- merge(df1, df2, by = "id", all.x = TRUE)

# View the result
left_join

This creates two data frames df1 and df2 and then performs a left join on the id column using the merge function. The all.x = TRUE argument specifies that all rows from df1 should be included in the result even if there is no matching row in df2. The resulting data frame contains all rows from df1 along with any matching rows from df2.

Paul Maxwell
  • 33,002
  • 3
  • 32
  • 51
  • @ Paul Maxwell: thank you so much for your answer! Can you please show me how I can convert your code into SQL code? Thank you so much! – stats_noob Mar 08 '23 at 02:24
  • Not familiar with Netazza but the above is reasonably "generic" and hopefully will work as is – Paul Maxwell Mar 08 '23 at 02:30
  • Mmm, perhas that isn't corect, ignore CROSS JOIN - sorry - you need a slightly different approach. give me a while – Paul Maxwell Mar 08 '23 at 02:35
  • @ Paul Maxwell: Thank you so much for your updates! If you have time, do you think you could please continue the SQL code? – stats_noob Mar 08 '23 at 03:08
  • The SQL is there already (I removed the cross join, it is now a left join). – Paul Maxwell Mar 08 '23 at 03:11
  • according to this: https://www.ibm.com/docs/en/db2woc?topic=migration-netezza-db2-compatibility DB2 is very similar for Netezza- so I am going to recommend https://dbfiddle.uk where you can create fiddles based on DB2 see https://dbfiddle.uk/TSyxO405 – Paul Maxwell Mar 08 '23 at 03:37
  • @ Paul Maxwell: Thank you so much for this link! – stats_noob Mar 08 '23 at 03:39
  • A related question -I tried this code here and ran it: https://dbfiddle.uk/bOxcqU5z ....nothing comes back (i cant see any rows in the output)... am I doing this correctly? – stats_noob Mar 08 '23 at 03:45
  • last comment here: https://dbfiddle.uk/6v0ebx78 if you have more please ask new questions which others can help you with too. fair? – Paul Maxwell Mar 08 '23 at 03:59
  • BTW: there is a "foible" at dbfiddle if you look at this: https://dbfiddle.uk/VZdf24fO you will see that each insert is in its own text box. Look carefully, uner the [+] there is a little hamburger menu. In that menu the second down option is to "split" that's' how I got you initial fiddle to work - the inserts needed to be "split" (or. don't do individual inserts which is another way to solve the issues). Hopefully this all makes sense. – Paul Maxwell Mar 08 '23 at 04:06
  • and (finally?) If possible can you "accept" the answer? It really helps others to know if a question has an accepted answer. – Paul Maxwell Mar 08 '23 at 04:09
  • @ paul maxwell: you wrote that this was an "Example query with join that lists all years and joins to data when there is a match". Will this also backfill the other variables such as the country_of_birth? – stats_noob Mar 08 '23 at 14:44