2

I have a dataframe in a database with a date dimension that I pulled using RPostgres. Each 'date' is in the format 'YYYY-MM-DD. I'd like to add a new column of dates (labeled 'lookback_date') that are one year earlier than the initial dates.

To be clear, if an observation's 'date' were '2000-01-01', I'd want to add a new 'lookback_date' of '1999-01-01' to that observation. Unfortunately, I can't figure out how to do that. Normally, I'd use Lubridate, but, from what I can tell, it doesn't work with dbplyr. Here's a streamlined version of my code so far. Everything in my actual code works fine up until the mutate function.

# Packages
library(dbplyr)
library(RPostgres)

# Connect to db 
drv <- dbDriver("Postgres")

# Setup connect to db
conn <- dbConnect(drv,
                  dbname = etc,)

# Define table to use in db
table <- tbl(conn, in_schema("xyz", "abc"))

#Select columns and filter
base_data <- table %>%
  #Filter for pertinent data
  filter(date > as.Date("2018-01-01") & date <= as.Date("2020-01-01"))

modified_data <- base_data %>%
mutate(lookback_date = date - 365)

Is there another way I can create this new column of dates?

Thanks!

  • Are you looking for a way to subtract one year from a date? Or are you looking for a way to subtract one year from a date that `dbplyr` will convert to SQL and execute server-side? The latter is why I assume you do not want to use `lubridate`. –  Feb 12 '21 at 22:02
  • @Adam I'd like to remove one year from the dates. Is there a way to do that with dbplyr? Printing 'modified_data' after running that mutate function throws an error. I don't have that problem when I use that code on local data frames. Hopefully I'm using the right terminology here. – NoCaresBear Feb 12 '21 at 22:27

2 Answers2

2

You are correct that lubridate and dbplyr to not play nicely together (right now). As a result of this, I do most of my dbplyr date manipulation using fragments of sql.

Based on this answer and this site, the postgresql syntax to add/subtract time from a date is:

SELECT old_date + INTERVAL '1 day' AS new_date;

Based on this I would try the following:

output = base_data %>% mutate(lookback_date = date - sql("INTERVAL '1 year'"))

When I do this with a simulated connection, it produces the correct syntax:

library(dplyr)
library(dbplyr)

df = data.frame(my_num = c(1,2,3), my_dates = as.Date(c('2000-01-01','2000-02-02','2000-03-03')))
df = tbl_lazy(df, con = simulate_postgres())

output = df %>% mutate(new_date = my_dates - sql("INTERVAL '1 year'"))

show_query(output)
# <SQL>
# SELECT `my_num`, `my_dates`, `my_dates` - INTERVAL '1 year' AS `new_date`
# FROM `df`

UPDATE: From comment, you first want to convert from date-time to date.

It appears that dbplyr does support the translation of as.Date to PostgreSQL (as.Date is part of base R, not part of lubridate). Hence you can use the following to cast (convert) a column to date:

library(dplyr)
library(dbplyr)

df = data.frame(my_str = c('2000-01-01','2000-02-02','2000-03-03'))
df = tbl_lazy(df, con = simulate_postgres())

output = df %>% mutate(my_date = as.Date(my_str))

show_query(output)
# <SQL>
# SELECT `my_str`, CAST(`my_str` AS DATE) AS `my_date`
# FROM `df`

It also appears that PostgreSQL does not allow you to add an interval of one year. One alternative to this is to extract the year, month, and day from the date, add one to the year and then recombine.

Following these two references (postgre date references and date_part fuction) and this answer, you probablywant something like the following:

output = df %>%
  mutate(the_year = DATE_PART('year', my_date),
         the_month = DATE_PART('month', my_date),
         the_day = DATE_PART('day', my_date)) %>%
  mutate(new_date = MAKE_DATE(the_year + 1, the_month, the_day)
Simon.S.A.
  • 6,240
  • 7
  • 22
  • 41
  • Thanks for your response! When I ran the mutate code, I received the following error: "Error: Failed to fetch row: ERROR: Interval values with month or year parts are not supported" (Could this be a Redshift issue?). When I changed '1 year' to '365 days', it seemed to fix the issue, but the date calculation doesn't seem to play well with leap years. For instance, an observation dated 2020-09-08" receives a "new_date" of "2019-09-09." Is there any way to work around that? Thank you for all your help! @Simon.S.A. EDIT: The new dates are in "dttm". How do I make them "dates" without lubridate? – NoCaresBear Feb 15 '21 at 17:20
  • Glad you found it helpful. I have expanded the answer to address your comment.You may still get some errors - as I can not replicate your R and SQL environment, there is no way to ensure my answer will work if you copy-and-paste it into your environment. But hopefully this is a strong starting point to adapt. – Simon.S.A. Feb 15 '21 at 20:26
1

You can use strings to pull the year subtract 1 and concatenate it back with the date and month. I am not sure if this will translate to sql though! This will also prevent leap years from messing up the days.

base_data %>%
  mutate(lookback_date = as.Date(paste0((as.numeric(substr(date,1,4)))-1,substr(date,5,10)),format="%Y-%m-%d"))

zimia
  • 930
  • 3
  • 16
  • Thanks for your response! When I attempt to glimpse() modified_data, I get the following error: "Error in as.Date(paste0((as.numeric(substr(service_date, 1, 4))) - 1, : unused argument (format = "%Y-%m-%d") " Have any idea why that might be? Thanks for your help! – NoCaresBear Feb 12 '21 at 23:49
  • hmm I don't have that issue. `Rows: 3 Columns: 2 $ date 2020-01-01, 1999-03-21, 1876-02-01 $ lookback_date 2019-01-01, 1998-03-21, 1875-02-01` . However you don't actually need to specify the format as the format is already in ISO-8601 so perhaps just remove the format argument entirely and give it a shot – zimia Feb 13 '21 at 00:26