This is a part of feature engineering that summarizes each ID depending on column called Col. The same preprocess will be applied to the testing set. Since the data set is large, data.table based solution may be more preferred.
Training Input:
ID Col
A M
A M
A M
B K
B M
Expected output for above training input:
ID Col_M Col_K
A 3 0 # A has 3 M in Col and 0 K in Col
B 1 1
Above is for processing training data. For testing dataset, if requires to mapping over Col_M, Col_K, meaning, if other value like S appearing in Col, it will be ignored.
Testing Input:
ID Col
C M
C S
Expected output for above testing input:
ID Col_M Col_K
C 1 0 # A has 1 M in Col and 0 K in Col. S value is ignored