1

This is a part of feature engineering that summarizes each ID depending on column called Col. The same preprocess will be applied to the testing set. Since the data set is large, data.table based solution may be more preferred.

Training Input:

ID   Col
A    M
A    M
A    M
B    K
B    M

Expected output for above training input:

ID   Col_M  Col_K
A    3      0      # A has 3 M in Col and 0 K in Col
B    1      1  

Above is for processing training data. For testing dataset, if requires to mapping over Col_M, Col_K, meaning, if other value like S appearing in Col, it will be ignored.

Testing Input:

ID   Col 
C    M
C    S

Expected output for above testing input:

ID   Col_M  Col_K
C    1      0      # A has 1 M in Col and 0 K in Col. S value is ignored
David Arenburg
  • 91,361
  • 17
  • 137
  • 196
HappyCoding
  • 5,029
  • 7
  • 31
  • 51

3 Answers3

1

A possible data.table implementation could be first filter by c("M", "K"), then add these level (in case they aren't present like in your second case), then running dcast while specifying drop = FALSE, fill = 0L (for the cases when one of the desired levels is missing) while specifying fun = length (in order to count).

Testing on both data sets

library(data.table)

### First example
df <- fread("ID   Col
A    M
A    M
A    M
B    K
B    M")

dcast(df[Col %in% c("M", "K")], # Work only with c("M", "K")
      ID ~ factor(Col, levels = union(unique(Col), c("M", "K"))), # Add missing levels
      drop = FALSE, # Keep missing levels in output
      fill = 0L, # Fill missing values with zeroes instead of NAs
      fun = length) # Count. you can also specify 'value.var'

#    ID M K
# 1:  A 3 0
# 2:  B 1 1

### Second example
df <- fread("ID   Col 
C    M
C    S")

dcast(df[Col %in% c("M", "K")], 
  ID ~ factor(Col, levels = union(unique(Col), c("M", "K"))), 
  drop = FALSE,
  fill = 0L,
  fun = length)

#    ID M K
# 1:  C 1 0
David Arenburg
  • 91,361
  • 17
  • 137
  • 196
  • Thanks for your answer and detailed explanation. The solution is general enough. Are you able to attend above two extensions? – HappyCoding Jun 21 '17 at 15:13
  • You are transforming this into a [chameleon question](https://meta.stackoverflow.com/questions/266767/what-is-the-the-best-way-to-ask-follow-up-questions). You can't add new edits after getting an answer. Adding a window function a whole new question. – David Arenburg Jun 21 '17 at 16:51
  • thanks for the suggestion. please find the new question at https://stackoverflow.com/questions/44688711/how-to-get-quick-summary-in-data-table-with-a-look-back-window – HappyCoding Jun 22 '17 at 01:36
0

I am not sure how big is your data and how flexible the expected code should be, but I have this:

zz = '
ID   Col
A    M
A    M
A    M
B    K
B    M
'

df <- read.table(text = zz, header = TRUE)
col = as.data.frame(table(df))

out <- reshape(col, idvar = "ID",
               timevar = "Col", direction = "wide")
out

which gives you:

> out
  ID Freq.K Freq.M
1  A      0      3
2  B      1      1

And for the second data frame:

yy = '
ID   Col
C    M
C    S
'

df1 <- read.table(text = yy, header = TRUE)
col1 = as.data.frame(table(df1))

out1 <- reshape(col1, idvar = "ID",
               timevar = "Col", direction = "wide")
out1

you get:

> out1
  ID Freq.M Freq.S
1  C      1      1

Then just merge them together and delete the redundant:

ss = merge(out1, out, all.y = T, all.x = T)
ss
  ID Freq.M Freq.S Freq.K
1  C      1      1     NA
2  A      3     NA      0
3  B      1     NA      1
AK88
  • 2,946
  • 2
  • 12
  • 31
0
> library(data.table)
> dt=NULL
> dt$ID=c("A","A","A","B","B")
> dt$Col=c("M","M","M","K","M")
> dt=data.frame(dt)
> dt=data.table(dt)
> dt
   ID Col
1:  A   M
2:  A   M
3:  A   M
4:  B   K
5:  B   M


> a=dt[Col=="M",sum(.N),ID]
> b=dt[Col=="K",sum(.N),ID]
> a
   ID V1
1:  A  3
2:  B  1
> b
   ID V1
1:  B  1
> setkey(a,ID)
> setkey(b,ID)

> m=b[a]
> m
   ID V1 i.V1
1:  A NA    3
2:  B  1    1
> names(m)=c("ID","Col_K","Col_M")
> m
   ID Col_K Col_M
1:  A    NA     3
2:  B     1     1
Ajay Ohri
  • 3,382
  • 3
  • 30
  • 60