How to get quick summary in data.table with a look-back window?

Question

This question is an extension based on how to get quick summary of count in data.table.

Similarly, this is a part of feature engineering that summarizes each ID depending on column called Col by looking back with certain time window. The same preprocess will be applied to the testing set. Since the data set is large, data.table based solution may be more preferred.

1. summerize on a variable and value with a look-back window

Training Input:

ID   Time        Col   Count 
A    2017-06-05   M      1
A    2017-06-02   M      1
A    2017-06-03   M      1
B    2017-06-02   K      1
B    2017-06-01   M      4

By applying two days looking back, we have:

ID   Time          Time-2D   Col   Count
A    2017-06-05   2017-06-03   M      1   #Time-2D by moving time two days back
A    2017-06-02   2017-05-31   M      1
A    2017-06-03   2017-06-01   M      1
B    2017-06-02   2017-05-31   K      1
B    2017-06-01   2017-05-30   M      4

Expected output (count):

ID   Time          Time-2D   Col_M    Col_K
A    2017-06-05   2017-06-03   1      0     #from 2017-06-03 to 2017-06-05, for user A, there are 0 (sum(count)) of K and 1 (sum(count)) of M. 
A    2017-06-02   2017-05-31   1      0
A    2017-06-03   2017-06-01   2      0     # 2 is because from 06-01 to 06-03, there is two rows in the first table (A    2017-06-02   M      1; A    2017-06-03   M      1) that the count summarization on M is 2. 
B    2017-06-02   2017-05-31   0      1
B    2017-06-01   2017-05-30   4      0

2. Calculate ratio

Based on above table, Expected output (ratio):

ID   Time          Time-2D   Col_M    Col_K
A    2017-06-05   2017-06-03   1      0     # 1/sum(1+0)
A    2017-06-02   2017-05-31   1      0
A    2017-06-03   2017-06-01   1      0     #2/sum(2+0)
B    2017-06-02   2017-05-31   0      1
B    2017-06-01   2017-05-30   1      0     # 4/sum(4+0)

Above is for processing training data. For testing dataset, if requires to mapping over Col_M, Col_K, meaning, if other value like S appearing in Col, it will be ignored.

maybe in your step 1, the expected output is wrong. not 2? – Peter Chen Jun 23 '17 at 07:33 — Peter Chen, Jun 23 '17 at 07:33

score 1 · Accepted Answer · answered Jun 27 '17 at 22:50

I think I understand your request. You seem to care about the order of the observations regardless if, for instance, the second observations Time is prior to the first observations Time. That doesn't make much sense, but here is a quit efficient data.table solution in order to achieve this. This is basically does a non-equi join by ID, Col, Both Time columns and the row index (which is basically the appearance order). Afterwards, it just dcast to convert from long to wide (like in your previous question). Note that the result is ordered by the dates, but I've kept the rowindx variable, so you can reorder it back using setorder. Also, I'll keep the ratio calc to you as this is very basic (hint - Don't use loops, it is a fully vectorized one liner)

library(data.table) #v1.10.4+

## Read the data
DT <- fread("ID   Time        Col   Count 
A    2017-06-05   M      1
A    2017-06-02   M      1
A    2017-06-03   M      1
B    2017-06-02   K      1
B    2017-06-01   M      4")

## Prepare the variables we need for the join
DT[, Time := as.IDate(Time)]
DT[, Time_2D := Time - 2L]
DT[, rowindx := .I]

## Non-equi join, sum `Count` by each join
DT2 <- DT[DT, 
          sum(Count), 
          on = .(ID, Col, rowindx <= rowindx, Time <= Time, Time >= Time_2D),
          by = .EACHI]

## Fix column names (a known issue)
setnames(DT2, make.unique(names(DT2)))

## Long to wide (You can reorder back using `rowindx` and `setorder` function)
dcast(DT2, ID + Time + Time.1 + rowindx ~ Col, value.var = "V1", fill = 0)
#    ID       Time     Time.1 rowindx K M
# 1:  A 2017-06-02 2017-05-31       2 0 1
# 2:  A 2017-06-03 2017-06-01       3 0 2
# 3:  A 2017-06-05 2017-06-03       1 0 1
# 4:  B 2017-06-01 2017-05-30       5 0 4
# 5:  B 2017-06-02 2017-05-31       4 1 0

have a quick question. do you know how to take column-wised sum for a list of columns? i.e., ```DT2 <- DT[DT, sum(Count), on = .(ID, Col, rowindx <= rowindx, Time <= Time, Time >= Time_2D), by = .EACHI]```, we have ```count``` column, as well as other columns. — HappyCoding, Jun 28 '17 at 03:12
Yes, Like you would always do in data.table. For instance, if you also have `Count2` and `Count3` cols, you could simply do: `DT2 <- DT[DT, lapply(.SD, sum), .SDcols = c("Count", "Count2", "Count3"),on = .(ID, Col, rowindx <= rowindx, Time <= Time, Time >= Time_2D), by = .EACHI]` — David Arenburg, Jun 28 '17 at 03:22
thanks. it works. I tried previously with .SDcols, however, it has conflix with keyby. Is there any difference between ```by=.EACHI``` and ```keyby=.EACHI``` — HappyCoding, Jun 28 '17 at 03:30
I'm not sure what do you mean by "*conflix*". The difference between `by` and `keyby` is that `by` leaves the result unordered while `keyby` adds a key (according to the specified columns) and as a result, orders the whole data by the specified columns. — David Arenburg, Jun 28 '17 at 05:31

Peter Chen · Answer 2 · 2017-06-23T08:13:08.490

0

You can try

dt <- fread("ID   Time          Time-2D   Col   Count
        A    2017-06-05   2017-06-03   M      1   
        A    2017-06-02   2017-05-31   M      1
        A    2017-06-03   2017-06-01   M      1
        B    2017-06-02   2017-05-31   K      1
        B    2017-06-01   2017-05-30   M      4") 
dt1 <- dcast(dt, ID+Time+`Time-2D`~Col, value.var = c("Count"))
dt1[, K := ifelse(is.na(K), 0, K)]
dt1[, M := ifelse(is.na(M), 0, M)]

   ID       Time    Time-2D K M
1:  A 2017-06-02 2017-05-31 0 1
2:  A 2017-06-03 2017-06-01 0 1
3:  A 2017-06-05 2017-06-03 0 1
4:  B 2017-06-01 2017-05-30 0 4
5:  B 2017-06-02 2017-05-31 1 0

dt1[, Col_K := K/(K+M)]
dt1[, Col_M := M/(K+M)]

    ID       Time    Time-2D K M Col_K Col_M
1:  A 2017-06-02 2017-05-31 0 1     0     1
2:  A 2017-06-03 2017-06-01 0 1     0     1
3:  A 2017-06-05 2017-06-03 0 1     0     1
4:  B 2017-06-01 2017-05-30 0 4     0     1
5:  B 2017-06-02 2017-05-31 1 0     1     0

Maybe you can combine last 2 lines. something like

dt1[, `:=`()]

edited Jun 23 '17 at 08:13

answered Jun 23 '17 at 06:26

Peter Chen

1,464
3
21
48

I think the aggregation and ratio steps are missing. – HappyCoding Jun 23 '17 at 07:11
Actually, the expected output you post above not contain aggregation and ratio. Or maybe I misunderstand your porblem. can u explain more and add in your question? – Peter Chen Jun 23 '17 at 07:18
explain more about your step 2 – Peter Chen Jun 23 '17 at 07:20
it's explained in the comment. e.g., # 4/sum(4+0), this mean 4/4=1 – HappyCoding Jun 23 '17 at 07:52
I edit it. maybe in your step 1, the expected output is wrong. not 2? – Peter Chen Jun 23 '17 at 08:14
it's not a typo. just added explanation about it. – HappyCoding Jun 23 '17 at 09:09
1

Actually, I cannot understand `# 2 is because from 06-01 to 06-03, there is two M occurs in A. ` Maybe you need to explaim more and describe clearly – Peter Chen Jun 23 '17 at 09:19
Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/147446/discussion-between-happycoding-and-peter-chen). – HappyCoding Jun 23 '17 at 09:49

How to get quick summary in data.table with a look-back window?

1. summerize on a variable and value with a look-back window

2. Calculate ratio

2 Answers2

Linked