0

I'm very well-versed in SQL, and an absolute novice in R. Unfortunately, due to an update in company policy, we must use Athena to run our SQL queries. Athena is weak, so despite having a complete/correct SQL query, I cannot run it to manipulate my large, insurance-based dataset. I have seen similar posts, but haven't managed to crack my own problem trying to utilize the methodologies provided. Here are the details:

  • After running the SQL block in R (using a connection string), I have a countrywide data block denoted CW_Data in R
  • Each record contains a policy with a multitude of characteristics (columns) such as the Policy_Number, Policy_Effective_Date, Policy_Earned_Premium
  • Athena breaks down when I try add two columns based on the already-existing ones
  • Namely, I want to left join such that I can obtain a new columns for Policy_Prior_Year_Earned_Premium and Policy_Second_Prior_Year_Earned_Premium

Per the above, I know I need to add columns such that, for a given policy, I can find the record where the Policy_Number=Policy_Number and Policy_Effective_Date = Policy_Effective_Date-1 or Policy_Effective_Date-2 years. This is quite simple in SQL, but I cannot get it in R for the life of me.

Here is the (watered-down) left join I attempted in SQL using CTEs that breaks Athena (even if the SQL is run via R):

All_Info as (
Select
       PC.Policy_Number
      ,PC.Policy_Effective_Date
      ,PC.Policy_EP
from Policy_Characteristics as PC

       left join Almost_All_Info as AAI
       on  AAI.Policy_Number         = PC.Policy_Number
       and AAI.Policy_Effective_Date = date_add('year', -1, PC.Policy_Effective_Date)
       
       left join All_Segments as AST
       on  AST.Policy_Number         = PC.Policy_Number
       and AST.Policy_Effective_Date = date_add('year', -2, PC.Policy_Effective_Date)
Group by 
       PC.Policy_Number
      ,PC.Policy_Effective_Date
      ,PC.Policy_EP
FWWIII
  • 11
  • 1
  • 2
    Not familiar with Athena, can you provide context on the relevance of R? eg, can you use any R code including packages? You mention running your SQL code “via R,” could you please add the R code to your question? – zephryl Dec 09 '22 at 13:47
  • 2
    Could you also please add a sample of your data, or else made-up data with the same structure? See [this thread](https://stackoverflow.com/q/5963269/17303805) on making reproducible examples. – zephryl Dec 09 '22 at 13:50

1 Answers1

0

As @zephryl pointed out, examples of data and expected result would be very helpful.

From your description, the R equivalent might look like this:

library(dplyr)
library(lubridate) ## datetime helpers

All_Info <- 
  Policy_Characteristics |>
  select(Policy_Number,
         Policy_Effective_Date, ## make sure this has class "Date"
         Policy_EP
         ) |>
  mutate(one_year_earlier = Policy_Effective_Date + duration(years = -1),
         two_years_earlier = Policy_Effective_Date + duration(years = -2)
         ) |>
  left_join(Almost_All_Info,
            by = c('Policy_Number' = 'Policy_Number',
                   'one_year_earlier' = 'Policy_Effective_Date'
                   )
            ) |>
  left_join(All_Segments,
            by = c('Policy_Number' = 'Policy_Number',
                   'two_years_earlier' = 'Policy_Effective_Date'
                   )
            ) |>
  group_by(Policy_Number,
           Policy_Effective_Date,
           Policy_EP
           )

I_O
  • 4,983
  • 2
  • 2
  • 15