2

General Question on conversion between SQL and R

I am starting with this sample dataset:

start = structure(list(id = c(111L, 111L, 111L, 111L, 222L, 222L, 222L
), year = c(2010L, 2011L, 2012L, 2013L, 2018L, 2019L, 2020L),
col2 = c("A", "BB", "A", "C", "D", "EE", "F"), col3 = c(242L,
213L, 233L, 455L, 11L, 444L, 123L), col4 = c(1213L, 5959L,
9988L, 4242L, 333L, 1232L, 98L)), class = "data.frame", row.names = c(NA,
-7L))

This is what I want to do in R:

 library(dplyr)

end <- start %>%
mutate(year_end = lead(year),
     col2_end = lead(col2),
     col3_end = lead(col3),
     col4_end = lead(col4)) %>%
mutate_at(vars(ends_with("_end")), ~ifelse(is.na(.), "END", .)) %>%
rename(year_start = year,
     col2_start = col2,
     col3_start = col3,
     col4_start = col4)

Now I try using SQL:

# my attempt to solve the problem
CREATE TABLE end AS
SELECT
id,
year AS year_start,
LEAD(year) OVER (PARTITION BY id ORDER BY year) AS year_end,
col2 AS col2_start,
LEAD(col2) OVER (PARTITION BY id ORDER BY year) AS col2_end,
col3 AS col3_start,
LEAD(col3) OVER (PARTITION BY id ORDER BY year) AS col3_end,
col4 AS col4_start,
LEAD(col4) OVER (PARTITION BY id ORDER BY year) AS col4_end
FROM start;

Is this code doing the same thing as the R?

In the past, I have normally done this kind of in R, but I have my data on an SQL Server and am trying to optimize this code for larger datasets.

I think the code is correct, but I wanted to have a second set of eyes inspect this code for any possible problems. Anyone see any issues?

Quinten
  • 35,235
  • 5
  • 20
  • 53
  • are you intentionally corrupting your integers? all of your `_end` variables are strings, which defeats the implied assumptions of variable class. (_Yes_, this can be reversed, but (1) not while retaining your special `"END"` indicator, and (2) it's safe _here_ but there are conditions where it will not be safe. It's far better to be explicit and safe with your data conversions.) – r2evans Apr 02 '23 at 16:49

1 Answers1

2

Are they doing the same thing? No. A few reasons:

  1. In your SQL code, you have partition by id, and the equivalent in dplyr is group_by(id), which you have omitted.

  2. Your data has integers in it, but you are silently converting (corrupting) them to character by using base::ifelse(., ., "END"). It is bad-practice to allow R to silently cast to a different class, especially if downstream expects things to be integers. If you expect all of your fields to be strings, then I suggest you explicitly convert them to strings first.

    For this, my code below will retain the integer class until the end, and you can optionally choose to change them to strings with the last mutate(..).

    Unless you know you strictly need the "END" string literal, I suggest it might be better to either use NA or some other number-like sentinel value, e.g., Inf. It depends heavily on how you're using it.

  3. In your R code, you replace nulls (NAs) with "END", and you don't attempt this in SQL. The right SQL verb is COALESCE after CASTing the data to varchar (or char or ...).

FYI, mutate_at has been superseded for a while, I recommend you shift to across.

Below are the updated code blocks.

dplyr

start %>%
  group_by(id) %>%
  mutate(across(everything(), ~ lead(.), .names = "{.col}_end")) %>%
  ungroup() %>%
  rename_with(.fn = ~ paste0(., "_start"), .cols = -c(id, ends_with("_end")))
# # A tibble: 7 × 9
#      id year_start col2_start col3_start col4_start year_end col2_end col3_end col4_end
#   <int>      <int> <chr>           <int>      <int>    <int> <chr>       <int>    <int>
# 1   111       2010 A                 242       1213     2011 BB            213     5959
# 2   111       2011 BB                213       5959     2012 A             233     9988
# 3   111       2012 A                 233       9988     2013 C             455     4242
# 4   111       2013 C                 455       4242       NA <NA>           NA       NA
# 5   222       2018 D                  11        333     2019 EE            444     1232
# 6   222       2019 EE                444       1232     2020 F             123       98
# 7   222       2020 F                 123         98       NA <NA>           NA       NA

If you really need "END" to indicate end-of-group (instead of what appears to be a clear NA here), then add to that:

... %>%
  mutate(across(matches("_(start|end)$"), ~ coalesce(as.character(.), "END")))
# # A tibble: 7 × 9
#      id year_start col2_start col3_start col4_start year_end col2_end col3_end col4_end
#   <int> <chr>      <chr>      <chr>      <chr>      <chr>    <chr>    <chr>    <chr>   
# 1   111 2010       A          242        1213       2011     BB       213      5959    
# 2   111 2011       BB         213        5959       2012     A        233      9988    
# 3   111 2012       A          233        9988       2013     C        455      4242    
# 4   111 2013       C          455        4242       END      END      END      END     
# 5   222 2018       D          11         333        2019     EE       444      1232    
# 6   222 2019       EE         444        1232       2020     F        123      98      
# 7   222 2020       F          123        98         END      END      END      END     

SQL

Your initial code produces the same as the first code block above (preserving integer class). I'll add tibble at the end just to get its column-class hints, not that it adds anything else to this discussion:

sqldf::sqldf("
SELECT
  id,
  year AS year_start,
  LEAD(year) OVER (PARTITION BY id ORDER BY year) AS year_end,
  col2 AS col2_start,
  LEAD(col2) OVER (PARTITION BY id ORDER BY year) AS col2_end,
  col3 AS col3_start,
  LEAD(col3) OVER (PARTITION BY id ORDER BY year) AS col3_end,
  col4 AS col4_start,
  LEAD(col4) OVER (PARTITION BY id ORDER BY year) AS col4_end
FROM start;") %>%
  tibble()
# # A tibble: 7 × 9
#      id year_start year_end col2_start col2_end col3_start col3_end col4_start col4_end
#   <int>      <int>    <int> <chr>      <chr>         <int>    <int>      <int>    <int>
# 1   111       2010     2011 A          BB              242      213       1213     5959
# 2   111       2011     2012 BB         A               213      233       5959     9988
# 3   111       2012     2013 A          C               233      455       9988     4242
# 4   111       2013       NA C          <NA>            455       NA       4242       NA
# 5   222       2018     2019 D          EE               11      444        333     1232
# 6   222       2019     2020 EE         F               444      123       1232       98
# 7   222       2020       NA F          <NA>            123       NA         98       NA

If you need SQL to also change your numbers to strings and use "END", then

sqldf::sqldf("SELECT
id,
CAST(year as varchar(8)) AS year_start,
COALESCE(CAST(LEAD(year) OVER (PARTITION BY id ORDER BY year) as VARCHAR(9)), 'END') AS year_end,
CAST(col2 as varchar(8)) AS col2_start,
COALESCE(CAST(LEAD(col2) OVER (PARTITION BY id ORDER BY year) as VARCHAR(9)), 'END') AS col2_end,
CAST(col3 as varchar(8)) AS col3_start,
COALESCE(CAST(LEAD(col3) OVER (PARTITION BY id ORDER BY year) as VARCHAR(9)), 'END') AS col3_end,
CAST(col4 as varchar(8)) AS col4_start,
COALESCE(CAST(LEAD(col4) OVER (PARTITION BY id ORDER BY year) as VARCHAR(9)), 'END') AS col4_end
FROM start;") %>%
  tibble()
# # A tibble: 7 × 9
#      id year_start year_end col2_start col2_end col3_start col3_end col4_start col4_end
#   <int> <chr>      <chr>    <chr>      <chr>    <chr>      <chr>    <chr>      <chr>   
# 1   111       2010 2011     A          BB              242 213            1213 5959    
# 2   111       2011 2012     BB         A               213 233            5959 9988    
# 3   111       2012 2013     A          C               233 455            9988 4242    
# 4   111       2013 END      C          END             455 END            4242 END     
# 5   222       2018 2019     D          EE               11 444             333 1232    
# 6   222       2019 2020     EE         F               444 123            1232 98      
# 7   222       2020 END      F          END             123 END              98 END     

If you haven't inferred so far, I am a stickler for consistency of variable class: too many times I've made assumptions about the class of an object and been stymied when R silently cast the variable to string or such. ifelse is a common culprit for this, contrast ifelse(c(T,T), 1, "A") and ifelse(c(T,F), 1, "A"); it is also not class-safe in other ways.

r2evans
  • 141,215
  • 6
  • 77
  • 149