1

I have a table of Customer_ID, showing Payments by Year. The first (of many) customer appears like this:

 ID    Payment    Year
112          0    2004
112          0    2005
112          0    2006
112       9592    2007
112      12332    2008
112       9234    2011
112       5400    2012
112       7392    2014
112       8321    2015

Note that some years are missing. I need to create 10 new columns, showing the Payments in the previous 10 years, for each row. The resulting table should look like this:

 ID    Payment    Year   T-1  T-2  T-3  T-4  T-5  T-6  T-7  T-8  T-9 T-10   
112          0    2004  NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL
112          0    2005     0 NULL NULL NULL NULL NULL NULL NULL NULL NULL
112          0    2006     0    0 NULL NULL NULL NULL NULL NULL NULL NULL
112        952    2007     0    0    0 NULL NULL NULL NULL NULL NULL NULL
112       1232    2008   952    0    0    0 NULL NULL NULL NULL NULL NULL
112        924    2011  NULL NULL 1232  952    0    0    0 NULL NULL NULL 
112        500    2012   924 NULL NULL 1232  952    0    0    0 NULL NULL 
112        392    2014  NULL  500  924 NULL NULL 1232  952    0    0    0
112        821    2015   392 NULL  500  924 NULL NULL 1232  952    0    0

(I know this is duplicating data - it is being prepared for a predictive model, in which previous payments (and other info) will be used to predict the current year's payment)

In SQL I would left join the table to itself, joining on ID and Year=(Year-1) etc... but I can't figure out how to do this in R.

I've also thought about using dplyr to group by ID, then mutate the new columns using lag, before ungrouping. But my tables are very large and I think this would be too slow. Ideally I would like to use data.table instead, but can't figure out how.

Any help much appreciated.

rw2
  • 1,549
  • 1
  • 11
  • 20
  • 1
    Could you replace the given dataframe, by a structure() command resulting from dput(df), please! – mabreitling Aug 04 '20 at 10:13
  • 1
    First: [Fastest way to add rows for missing time steps?](https://stackoverflow.com/questions/10438969/fastest-way-to-add-rows-for-missing-time-steps). Then: [How can I automatically create n lags in a timeseries?](https://stackoverflow.com/a/28056113/1851712) – Henrik Aug 04 '20 at 10:23
  • Henrik - I like the answer in the second link, using shift, but it doesn't group by ID - it just always takes from the row above even if they are different IDs. I'm not sure how the first link is related? – rw2 Aug 04 '20 at 10:58

2 Answers2

3

You first do a merge with the combination of all dates and ID to have the missing years:

dftot <- merge(df,CJ(Year =seq(min(df$Year),max(df$Year),1),ID = unique(df$ID)),all = T,by = "Year")
dftot[,ID := ID.y]
dftot[,c("ID.x","ID.y") := NULL]
dftot[,Year := as.numeric(Year)]
dftot <- dftot[order(Year)]

    Year Payment  ID
 1: 2004       0 112
 2: 2005       0 112
 3: 2006       0 112
 4: 2007    9592 112
 5: 2008   12332 112
 6: 2009      NA 112
 7: 2010      NA 112
 8: 2011    9234 112
 9: 2012    5400 112
10: 2013      NA 112
11: 2014    7392 112
12: 2015    8321 112

You then create the lagged columns, and reselect the lines with non missing Payement:

dftot[,c(paste0("T-",1:10)) := lapply(1:10,function(i){
    if(.N>1){
      c(rep(NA,i),Payment[1:(.N-i)])
    }else{NA}
  }),by = ID ][!is.na(Payment)]




   Year Payment  ID  T-1  T-2   T-3   T-4  T-5   T-6   T-7  T-8 T-9 T-10
1: 2004       0 112   NA   NA    NA    NA   NA    NA    NA   NA  NA   NA
2: 2005       0 112    0   NA    NA    NA   NA    NA    NA   NA  NA   NA
3: 2006       0 112    0    0    NA    NA   NA    NA    NA   NA  NA   NA
4: 2007    9592 112    0    0     0    NA   NA    NA    NA   NA  NA   NA
5: 2008   12332 112 9592    0     0     0   NA    NA    NA   NA  NA   NA
6: 2011    9234 112   NA   NA 12332  9592    0     0     0   NA  NA   NA
7: 2012    5400 112 9234   NA    NA 12332 9592     0     0    0  NA   NA
8: 2014    7392 112   NA 5400  9234    NA   NA 12332  9592    0   0    0
9: 2015    8321 112 7392   NA  5400  9234   NA    NA 12332 9592   0    0

This should be quite efficient and should handle multiple IDs


The data

library(data.table)
df <- setDT(read.table(text = "ID    Payment    Year
112          0    2004
                       112          0    2005
                       112          0    2006
                       112       9592    2007
                       112      12332    2008
                       112       9234    2011
                       112       5400    2012
                       112       7392    2014
                       112       8321    2015",header = T))
denis
  • 5,580
  • 1
  • 13
  • 40
  • I think this almost works, but seems to break when I have ID's who have just a single year of payments. – rw2 Sep 15 '20 at 12:32
  • I get the following error: Error in Payment[1:(.N - i)] : only 0's may be mixed with negative subscripts – rw2 Sep 15 '20 at 13:00
  • Now it is breaking when it reaches any borrower with more than 1, but fewer than 10, rows. If I change the number of rows to spread to 5, then it will break when it reaches a borrower with fewer than 5 rows etc. – rw2 Sep 22 '20 at 09:35
  • could you edit your example ? I ll have a look with the example that reproduce the pb – denis Sep 22 '20 at 14:28
0

Here is a base R option, similar idea to the solution by @denis

u <- merge(df1,
  data.frame(ID = unique(df1$ID), Year = min(df1$Year):max(df1$Year)),
  by = c("ID", "Year"), all = TRUE
)

subset(cbind(u, `colnames<-`(do.call(
  rbind,
  lapply(
    Reduce(c, c(NA, u$Payment), accumulate = TRUE)[1:nrow(u)],
    function(x) `length<-`(head(rev(x), 10), 10)
  )
), paste0("T-", 1:10))), !is.na(Payment))

such that

    ID Year Payment  T-1  T-2   T-3   T-4  T-5   T-6   T-7  T-8 T-9 T-10
1  112 2004       0   NA   NA    NA    NA   NA    NA    NA   NA  NA   NA
2  112 2005       0    0   NA    NA    NA   NA    NA    NA   NA  NA   NA
3  112 2006       0    0    0    NA    NA   NA    NA    NA   NA  NA   NA
4  112 2007    9592    0    0     0    NA   NA    NA    NA   NA  NA   NA
5  112 2008   12332 9592    0     0     0   NA    NA    NA   NA  NA   NA
8  112 2011    9234   NA   NA 12332  9592    0     0     0   NA  NA   NA
9  112 2012    5400 9234   NA    NA 12332 9592     0     0    0  NA   NA
11 112 2014    7392   NA 5400  9234    NA   NA 12332  9592    0   0    0
12 112 2015    8321 7392   NA  5400  9234   NA    NA 12332 9592   0    0

Data

> dput(df1)
structure(list(ID = c(112L, 112L, 112L, 112L, 112L, 112L, 112L, 
112L, 112L), Payment = c(0L, 0L, 0L, 9592L, 12332L, 9234L, 5400L,
7392L, 8321L), Year = c(2004L, 2005L, 2006L, 2007L, 2008L, 2011L,
2012L, 2014L, 2015L)), class = "data.frame", row.names = c(NA,
-9L))
ThomasIsCoding
  • 96,636
  • 9
  • 24
  • 81