-2

This question is an extension based on how to get quick summary of count in data.table.

Similarly, this is a part of feature engineering that summarizes each ID depending on column called Col by looking back with certain time window. The same preprocess will be applied to the testing set. Since the data set is large, data.table based solution may be more preferred.

1. summerize on a variable and value with a look-back window

Training Input:

ID   Time        Col   Count 
A    2017-06-05   M      1
A    2017-06-02   M      1
A    2017-06-03   M      1
B    2017-06-02   K      1
B    2017-06-01   M      4

By applying two days looking back, we have:

ID   Time          Time-2D   Col   Count
A    2017-06-05   2017-06-03   M      1   #Time-2D by moving time two days back
A    2017-06-02   2017-05-31   M      1
A    2017-06-03   2017-06-01   M      1
B    2017-06-02   2017-05-31   K      1
B    2017-06-01   2017-05-30   M      4

Expected output (count):

ID   Time          Time-2D   Col_M    Col_K
A    2017-06-05   2017-06-03   1      0     #from 2017-06-03 to 2017-06-05, for user A, there are 0 (sum(count)) of K and 1 (sum(count)) of M. 
A    2017-06-02   2017-05-31   1      0
A    2017-06-03   2017-06-01   2      0     # 2 is because from 06-01 to 06-03, there is two rows in the first table (A    2017-06-02   M      1; A    2017-06-03   M      1) that the count summarization on M is 2. 
B    2017-06-02   2017-05-31   0      1
B    2017-06-01   2017-05-30   4      0

2. Calculate ratio

Based on above table, Expected output (ratio):

ID   Time          Time-2D   Col_M    Col_K
A    2017-06-05   2017-06-03   1      0     # 1/sum(1+0)
A    2017-06-02   2017-05-31   1      0
A    2017-06-03   2017-06-01   1      0     #2/sum(2+0)
B    2017-06-02   2017-05-31   0      1
B    2017-06-01   2017-05-30   1      0     # 4/sum(4+0) 

Above is for processing training data. For testing dataset, if requires to mapping over Col_M, Col_K, meaning, if other value like S appearing in Col, it will be ignored.

HappyCoding
  • 5,029
  • 7
  • 31
  • 51

2 Answers2

1

I think I understand your request. You seem to care about the order of the observations regardless if, for instance, the second observations Time is prior to the first observations Time. That doesn't make much sense, but here is a quit efficient data.table solution in order to achieve this. This is basically does a non-equi join by ID, Col, Both Time columns and the row index (which is basically the appearance order). Afterwards, it just dcast to convert from long to wide (like in your previous question). Note that the result is ordered by the dates, but I've kept the rowindx variable, so you can reorder it back using setorder. Also, I'll keep the ratio calc to you as this is very basic (hint - Don't use loops, it is a fully vectorized one liner)

library(data.table) #v1.10.4+

## Read the data
DT <- fread("ID   Time        Col   Count 
A    2017-06-05   M      1
A    2017-06-02   M      1
A    2017-06-03   M      1
B    2017-06-02   K      1
B    2017-06-01   M      4")

## Prepare the variables we need for the join
DT[, Time := as.IDate(Time)]
DT[, Time_2D := Time - 2L]
DT[, rowindx := .I]

## Non-equi join, sum `Count` by each join
DT2 <- DT[DT, 
          sum(Count), 
          on = .(ID, Col, rowindx <= rowindx, Time <= Time, Time >= Time_2D),
          by = .EACHI]

## Fix column names (a known issue)
setnames(DT2, make.unique(names(DT2)))

## Long to wide (You can reorder back using `rowindx` and `setorder` function)
dcast(DT2, ID + Time + Time.1 + rowindx ~ Col, value.var = "V1", fill = 0)
#    ID       Time     Time.1 rowindx K M
# 1:  A 2017-06-02 2017-05-31       2 0 1
# 2:  A 2017-06-03 2017-06-01       3 0 2
# 3:  A 2017-06-05 2017-06-03       1 0 1
# 4:  B 2017-06-01 2017-05-30       5 0 4
# 5:  B 2017-06-02 2017-05-31       4 1 0
David Arenburg
  • 91,361
  • 17
  • 137
  • 196
  • have a quick question. do you know how to take column-wised sum for a list of columns? i.e., ```DT2 <- DT[DT, sum(Count), on = .(ID, Col, rowindx <= rowindx, Time <= Time, Time >= Time_2D), by = .EACHI]```, we have ```count``` column, as well as other columns. – HappyCoding Jun 28 '17 at 03:12
  • Yes, Like you would always do in data.table. For instance, if you also have `Count2` and `Count3` cols, you could simply do: `DT2 <- DT[DT, lapply(.SD, sum), .SDcols = c("Count", "Count2", "Count3"),on = .(ID, Col, rowindx <= rowindx, Time <= Time, Time >= Time_2D), by = .EACHI]` – David Arenburg Jun 28 '17 at 03:22
  • thanks. it works. I tried previously with .SDcols, however, it has conflix with keyby. Is there any difference between ```by=.EACHI``` and ```keyby=.EACHI``` – HappyCoding Jun 28 '17 at 03:30
  • I'm not sure what do you mean by "*conflix*". The difference between `by` and `keyby` is that `by` leaves the result unordered while `keyby` adds a key (according to the specified columns) and as a result, orders the whole data by the specified columns. – David Arenburg Jun 28 '17 at 05:31
0

You can try

dt <- fread("ID   Time          Time-2D   Col   Count
        A    2017-06-05   2017-06-03   M      1   
        A    2017-06-02   2017-05-31   M      1
        A    2017-06-03   2017-06-01   M      1
        B    2017-06-02   2017-05-31   K      1
        B    2017-06-01   2017-05-30   M      4") 
dt1 <- dcast(dt, ID+Time+`Time-2D`~Col, value.var = c("Count"))
dt1[, K := ifelse(is.na(K), 0, K)]
dt1[, M := ifelse(is.na(M), 0, M)]

   ID       Time    Time-2D K M
1:  A 2017-06-02 2017-05-31 0 1
2:  A 2017-06-03 2017-06-01 0 1
3:  A 2017-06-05 2017-06-03 0 1
4:  B 2017-06-01 2017-05-30 0 4
5:  B 2017-06-02 2017-05-31 1 0

dt1[, Col_K := K/(K+M)]
dt1[, Col_M := M/(K+M)]

    ID       Time    Time-2D K M Col_K Col_M
1:  A 2017-06-02 2017-05-31 0 1     0     1
2:  A 2017-06-03 2017-06-01 0 1     0     1
3:  A 2017-06-05 2017-06-03 0 1     0     1
4:  B 2017-06-01 2017-05-30 0 4     0     1
5:  B 2017-06-02 2017-05-31 1 0     1     0

Maybe you can combine last 2 lines. something like

dt1[, `:=`()]
Peter Chen
  • 1,464
  • 3
  • 21
  • 48