1

So I've got a data frame like below:

+-----------+---------------+-------+------------+  
| Policy_NO | Creation_Date | Limit | Limit_Date |  
+-----------+---------------+-------+------------+  
| A00001    | 8/31/2015     |  1000 | 8/31/2015  |  
| A00001    | 8/31/2015     |  2000 | 9/30/2015  |  
| A00001    | 8/31/2015     |  5000 | 10/22/2015 |  
| A00001    | 8/31/2015     |   500 | 11/17/2015 |  
| A00003    | 9/21/2015     |  3000 | 1/1/2016   |  
+-----------+---------------+-------+------------+

And what I want is to have an 'End_Limit_Date' which should either be the date of the next limit with the same policy number -1 or exactly one year from the Creation date.

For example, the table above should be:

+------------------------------------------------+----------------+
|  Policy_NO  Creation_Date  Limit  Limit_Date   | End_Limit_Date |
+------------------------------------------------+----------------+
| A00001     8/31/2015       1000  8/31/2015     | 9/29/2015      |
| A00001     8/31/2015       2000  9/30/2015     | 10/21/2015     |
| A00001     8/31/2015       5000  10/22/2015    | 11/16/2015     |
| A00001     8/31/2015        500  11/17/2015    | 8/31/2016      |
| A00003     9/21/2015       3000  1/1/2016      | 9/21/2016      |
+------------------------------------------------+----------------+

Now in sql server, this can be achieved by using an outer apply like below:

SELECT  t.Policy_NO,
        t.Creation_Date,
        t.Limit,
        t.Limit_Date,
        End_Limit_Date = ISNULL(
                            DATEADD(DAY, -1, t2.Limit_Date), 
                            DATEADD(YEAR, 1, t.Creation_Date))
FROM    dbo.T 
        OUTER APPLY
        (   SELECT  TOP 1 t2.Limit_Date
            FROM    dbo.T AS t2
            WHERE   t2.Policy_NO = t.Policy_NO
            AND     t2.Limit_Date > t.Limit_Date
            ORDER BY t2.Limit_Date
        ) AS t2;

but I was wondering if there's a way to do this in R with data frames? I've looked into using the sqldf package (https://github.com/ggrothendieck/sqldf#FAQ) , and I don't think it's supported. I do know that R itself has several apply functions (https://nsaunders.wordpress.com/2010/08/20/a-brief-introduction-to-apply-in-r/), and I was wondering if I can achieve the same result using this?

Input data:

dput(df)
structure(list(Policy_NO = structure(c(1L, 1L, 1L, 1L, 2L), .Label = c("A00001", 
"A00003"), class = "factor"), Creation_Date = structure(c(16678, 
16678, 16678, 16678, 16699), class = "Date"), Limit = c(1000L, 
2000L, 5000L, 500L, 3000L), Limit_Date = structure(c(16678, 16708, 
16730, 16756, 16801), class = "Date")), .Names = c("Policy_NO", 
"Creation_Date", "Limit", "Limit_Date"), row.names = c(NA, -5L
), class = "data.frame")
Gopala
  • 10,363
  • 7
  • 45
  • 77
Farellia
  • 187
  • 1
  • 2
  • 14
  • 1
    Please provide reproducible data: http://stackoverflow.com/a/5963610/1412059 You need a split-apply-combine function. I would use package data.table for this. – Roland Jan 12 '16 at 15:57
  • 1
    I think the fact that R and SQL both have `apply` functions is just a coincidence; in my experience of using both SQL's `CROSS APPLY` / `OUTER APPLY` and R's `*apply` family, the functionality provided by each are just fundamentally different. – nrussell Jan 12 '16 at 16:02

2 Answers2

4

The default backend of sqldf is SQLite and SQLite is not very good for processing dates because it has no date class -- it does have date functions but these are not standard SQL and there are other problems; however, you can get close to your SQL statement by using the H2 backend of sqldf and a self-join. Note that LIMIT is a keyword and must be escaped. Also note that the first time you use the H2 package in a session it will load rJava so the first H2 query in any session will be slow.

library(sqldf)
library(RH2)

sql <- "SELECT 
          t.Policy_NO,
          t.Creation_Date,
          t.`Limit`,
          t.Limit_Date,
          ISNULL( DATEADD('DAY', -1, MIN(t2.Limit_Date)), 
                  DATEADD('YEAR', 1, t.Creation_Date)) End_Limit_Date
        FROM df t
        LEFT JOIN df t2 ON t2.Policy_NO = t.Policy_NO
                        AND t2.Limit_Date > t.Limit_Date
        GROUP BY t.Policy_NO, t.Creation_Date, t.`Limit`, t.Limit_Date
        ORDER BY 1, 2, 4
"

sqldf(sql)

giving:

> sqldf(sql)
  Policy_NO Creation_Date Limit Limit_Date End_Limit_Date
1    A00001    2015-08-31  1000 2015-08-31     2015-09-29
2    A00001    2015-08-31  2000 2015-09-30     2015-10-21
3    A00001    2015-08-31  5000 2015-10-22     2015-11-16
4    A00001    2015-08-31   500 2015-11-17     2016-08-31
5    A00003    2015-09-21  3000 2016-01-01     2016-09-21

Note: If you do want to use the default SQLite backend then be sure that RH2 is not loaded and try this. Note that SQLite functions assume that date numbers count from an origin 2440588 days earlier than the UNIX Epoch used in R's "Date" class, hence the translation by this amount. The date function produces the date of an SQLite date number as a character string and the julianday function is its inverse (except that it measures from noon so we have to use an offset of 0.5). We use thename__class method of sqldf to cause a translation back to "Date" class for those variables whose name ends in __Date (two underscores).

library(sqldf)


sql <- 'SELECT 
          t.Policy_NO,
          t.Creation_Date Creation_Date__Date,
          t."Limit",
          t.Limit_Date Limit_Date__Date,
          IFNULL(MIN(t2.Limit_Date) - 1, 
                 julianday(date(t.Creation_Date + 2440588, "+1 year")) - 2440588 + 0.5
                ) End_Limit_Date__Date
        FROM    df t
        LEFT JOIN df t2 ON t2.Policy_NO = t.Policy_NO
                        AND t2.Limit_Date > t.Limit_Date
        GROUP BY t.rowid
        ORDER by t.rowid'

sqldf(sql, method = "name__class")
G. Grothendieck
  • 254,981
  • 17
  • 203
  • 341
3

Input data:

dput(df)
structure(list(Policy_NO = structure(c(1L, 1L, 1L, 1L, 2L), .Label = c("A00001", 
"A00003"), class = "factor"), Creation_Date = structure(c(16678, 
16678, 16678, 16678, 16699), class = "Date"), Limit = c(1000L, 
2000L, 5000L, 500L, 3000L), Limit_Date = structure(c(16678, 16708, 
16730, 16756, 16801), class = "Date")), .Names = c("Policy_NO", 
"Creation_Date", "Limit", "Limit_Date"), row.names = c(NA, -5L
), class = "data.frame")

Using library dplyr, you can try something like this:

library(dplyr)
df %>% 
  arrange(Policy_NO, Limit_Date) %>% 
  group_by(Policy_NO) %>% 
  mutate(End_Limit_Date = as.Date(ifelse(is.na(lead(Limit_Date)), 
                                         Creation_Date + 366, 
                                         lead(Limit_Date) - 1)))

Output is as follows:

Source: local data frame [5 x 5]
Groups: Policy_NO [2]

  Policy_NO Creation_Date Limit Limit_Date End_Limit_Date
     (fctr)        (date) (int)     (date)         (date)
1    A00001    2015-08-31  1000 2015-08-31     2015-09-29
2    A00001    2015-08-31  2000 2015-09-30     2015-10-21
3    A00001    2015-08-31  5000 2015-10-22     2015-11-16
4    A00001    2015-08-31   500 2015-11-17     2016-08-31
5    A00003    2015-09-21  3000 2016-01-01     2016-09-21

Note, you need to convert the dates to proper date format as follows:

df$Creation_Date <- as.Date(df$Creation_Date, format = '%m/%d/%Y')
df$Limit_Date <- as.Date(df$Limit_Date, format = '%m/%d/%Y')
joran
  • 169,992
  • 32
  • 429
  • 468
Gopala
  • 10,363
  • 7
  • 45
  • 77
  • 1
    If you went to the trouble of recreating the example data, can you please just add the `dput` output to the question body? – Roland Jan 12 '16 at 16:04