This question is an extension based on how to get quick summary of count in data.table.
Similarly, this is a part of feature engineering that summarizes each ID depending on column called Col by looking back with certain time window. The same preprocess will be applied to the testing set. Since the data set is large, data.table based solution may be more preferred.
1. summerize on a variable and value with a look-back window
Training Input:
ID Time Col Count
A 2017-06-05 M 1
A 2017-06-02 M 1
A 2017-06-03 M 1
B 2017-06-02 K 1
B 2017-06-01 M 4
By applying two days looking back, we have:
ID Time Time-2D Col Count
A 2017-06-05 2017-06-03 M 1 #Time-2D by moving time two days back
A 2017-06-02 2017-05-31 M 1
A 2017-06-03 2017-06-01 M 1
B 2017-06-02 2017-05-31 K 1
B 2017-06-01 2017-05-30 M 4
Expected output (count):
ID Time Time-2D Col_M Col_K
A 2017-06-05 2017-06-03 1 0 #from 2017-06-03 to 2017-06-05, for user A, there are 0 (sum(count)) of K and 1 (sum(count)) of M.
A 2017-06-02 2017-05-31 1 0
A 2017-06-03 2017-06-01 2 0 # 2 is because from 06-01 to 06-03, there is two rows in the first table (A 2017-06-02 M 1; A 2017-06-03 M 1) that the count summarization on M is 2.
B 2017-06-02 2017-05-31 0 1
B 2017-06-01 2017-05-30 4 0
2. Calculate ratio
Based on above table, Expected output (ratio):
ID Time Time-2D Col_M Col_K
A 2017-06-05 2017-06-03 1 0 # 1/sum(1+0)
A 2017-06-02 2017-05-31 1 0
A 2017-06-03 2017-06-01 1 0 #2/sum(2+0)
B 2017-06-02 2017-05-31 0 1
B 2017-06-01 2017-05-30 1 0 # 4/sum(4+0)
Above is for processing training data. For testing dataset, if requires to mapping over Col_M, Col_K, meaning, if other value like S appearing in Col, it will be ignored.