I have this dataset in R:
name = c("john", "john", "john", "sarah", "sarah", "peter", "peter", "peter", "peter")
year = c(2010, 2011, 2014, 2010, 2015, 2011, 2012, 2013, 2015)
age = c(21, 22, 25, 55, 60, 61, 62, 63, 65)
gender = c("male", "male", "male", "female", "female", "male", "male", "male", "male" )
country_of_birth = c("australia", "australia", "australia", "uk", "uk", "mexico", "mexico", "mexico", "mexico")
source = "ORIGINAL"
my_data = data.frame(name, year, age, gender, country_of_birth, source)
As we can see, some of the people in this dataset have rows with missing years (e.g. John goes from 2011 to 2014):
name year age gender country_of_birth source
1 john 2010 21 male australia ORIGINAL
2 john 2011 22 male australia ORIGINAL
3 john 2014 25 male australia ORIGINAL
4 sarah 2010 55 female uk ORIGINAL
5 sarah 2015 60 female uk ORIGINAL
6 peter 2011 61 male mexico ORIGINAL
7 peter 2012 62 male mexico ORIGINAL
8 peter 2013 63 male mexico ORIGINAL
9 peter 2015 65 male mexico ORIGINAL
I have this code that is able to add these missing rows by "interpolating" logical values of the missing rows (e.g. age increases by 1, country_of_birth stays the same, etc.), and records if this row was added later or original:
library(tidyverse)
library(dplyr)
# R Code to Convert into SQL
final = my_data %>%
group_by(name) %>%
complete(year = first(year): last(year)) %>%
mutate(age = ifelse(is.na(age), first(age)+row_number()-1, age)) %>%
fill(c(gender, country_of_birth), .direction = "down") %>%
mutate(source = ifelse(is.na(source), "NOT ORIGINAL", source))
# A tibble: 16 x 6
# Groups: name [3]
name year age gender country_of_birth source
<chr> <dbl> <dbl> <chr> <chr> <chr>
1 john 2010 21 male australia ORIGINAL
2 john 2011 22 male australia ORIGINAL
3 john 2012 23 male australia NOT ORIGINAL
My Question: I am trying to learn how I can convert this above code into (Netezza) SQL Code.
To get an idea how to begin, I thought I could use the "dbplyr" library in R to convert my "dplyr" code into SQL:
library(dbplyr)
# attempt 1
remote_df = tbl_lazy(my_data, con = simulate_mysql())
remote_df %>%
group_by(name) %>%
complete(year = first(year): last(year)) %>%
mutate(age = ifelse(is.na(age), first(age)+row_number()-1, age)) %>%
fill(c(gender, country_of_birth), .direction = "down") %>%
mutate(source = ifelse(is.na(source), "MISSING", source)) %>% show_query()
# attempt 2
remote_df = tbl_lazy(my_data, con = simulate_mssql())
remote_df %>%
group_by(name) %>%
complete(year = first(year): last(year)) %>%
mutate(age = ifelse(is.na(age), first(age)+row_number()-1, age)) %>%
fill(c(gender, country_of_birth), .direction = "down") %>%
mutate(source = ifelse(is.na(source), "MISSING", source)) %>% show_query()
# attempt 3
con <- DBI::dbConnect(RSQLite::SQLite(), ":memory:")
remote_df <- copy_to(con, my_data)
remote_df %>%
group_by(name) %>%
complete(year = first(year): last(year)) %>%
mutate(age = ifelse(is.na(age), first(age)+row_number()-1, age)) %>%
fill(c(gender, country_of_birth), .direction = "down") %>%
mutate(source = ifelse(is.na(source), "NOT ORIGINAL", source))
# attempt 4
memdb_frame(my_data) %>%
group_by(name) %>%
complete(year = first(year): last(year)) %>%
mutate(age = ifelse(is.na(age), first(age)+row_number()-1, age)) %>%
fill(c(gender, country_of_birth), .direction = "down") %>%
mutate(source = ifelse(is.na(source), "MISSING", source)) %>% show_query()
However all these attempts give me the same error:
Error in `fill()`:
x `.data` does not have explicit order.
i Please use `arrange()` or `window_order()` to make determinstic.
Run `rlang::last_error()` to see where the error occurred.
Can someone please show me what I am doing wrong and what I can do to convert this R code into SQL code? I had thought that perhaps I could find out which rows are missing for which person, create these rows - and then somehow use JOINS to bring them back to the original dataset in SQL.
Thanks!