how to get quick summary of count in data.table

Question

This is a part of feature engineering that summarizes each ID depending on column called Col. The same preprocess will be applied to the testing set. Since the data set is large, data.table based solution may be more preferred.

Training Input:

ID   Col
A    M
A    M
A    M
B    K
B    M

Expected output for above training input:

ID   Col_M  Col_K
A    3      0      # A has 3 M in Col and 0 K in Col
B    1      1

Above is for processing training data. For testing dataset, if requires to mapping over Col_M, Col_K, meaning, if other value like S appearing in Col, it will be ignored.

Testing Input:

ID   Col 
C    M
C    S

Expected output for above testing input:

ID   Col_M  Col_K
C    1      0      # A has 1 M in Col and 0 K in Col. S value is ignored

can you add some paragraphs to explain what is shown? dummy code can help creating good answers :) — pachadotdev, Jun 21 '17 at 04:32
If you actually have R installed, and the data.table package as well, there's this: https://stackoverflow.com/q/18881073/ Just define your own `inds` values. — Frank, Jun 21 '17 at 04:38

David Arenburg · Accepted Answer · 2017-06-21T07:18:40.107

1

A possible data.table implementation could be first filter by c("M", "K"), then add these level (in case they aren't present like in your second case), then running dcast while specifying drop = FALSE, fill = 0L (for the cases when one of the desired levels is missing) while specifying fun = length (in order to count).

Testing on both data sets

library(data.table)

### First example
df <- fread("ID   Col
A    M
A    M
A    M
B    K
B    M")

dcast(df[Col %in% c("M", "K")], # Work only with c("M", "K")
      ID ~ factor(Col, levels = union(unique(Col), c("M", "K"))), # Add missing levels
      drop = FALSE, # Keep missing levels in output
      fill = 0L, # Fill missing values with zeroes instead of NAs
      fun = length) # Count. you can also specify 'value.var'

#    ID M K
# 1:  A 3 0
# 2:  B 1 1

### Second example
df <- fread("ID   Col 
C    M
C    S")

dcast(df[Col %in% c("M", "K")], 
  ID ~ factor(Col, levels = union(unique(Col), c("M", "K"))), 
  drop = FALSE,
  fill = 0L,
  fun = length)

#    ID M K
# 1:  C 1 0

edited Jun 21 '17 at 07:18

answered Jun 21 '17 at 06:40

David Arenburg

91,361
17
137
196

Thanks for your answer and detailed explanation. The solution is general enough. Are you able to attend above two extensions? – HappyCoding Jun 21 '17 at 15:13
You are transforming this into a [chameleon question](https://meta.stackoverflow.com/questions/266767/what-is-the-the-best-way-to-ask-follow-up-questions). You can't add new edits after getting an answer. Adding a window function a whole new question. – David Arenburg Jun 21 '17 at 16:51
thanks for the suggestion. please find the new question at https://stackoverflow.com/questions/44688711/how-to-get-quick-summary-in-data-table-with-a-look-back-window – HappyCoding Jun 22 '17 at 01:36

score 0 · Answer 2 · answered Jun 21 '17 at 04:50

I am not sure how big is your data and how flexible the expected code should be, but I have this:

zz = '
ID   Col
A    M
A    M
A    M
B    K
B    M
'

df <- read.table(text = zz, header = TRUE)
col = as.data.frame(table(df))

out <- reshape(col, idvar = "ID",
               timevar = "Col", direction = "wide")
out

which gives you:

> out
  ID Freq.K Freq.M
1  A      0      3
2  B      1      1

And for the second data frame:

yy = '
ID   Col
C    M
C    S
'

df1 <- read.table(text = yy, header = TRUE)
col1 = as.data.frame(table(df1))

out1 <- reshape(col1, idvar = "ID",
               timevar = "Col", direction = "wide")
out1

you get:

> out1
  ID Freq.M Freq.S
1  C      1      1

Then just merge them together and delete the redundant:

ss = merge(out1, out, all.y = T, all.x = T)
ss
  ID Freq.M Freq.S Freq.K
1  C      1      1     NA
2  A      3     NA      0
3  B      1     NA      1

That's right. I was waiting for a comment if this was OK or not, Then I'd edit the last bit of the code to get rid off `S` col and sub `NAs` to `0`. — AK88, Jun 21 '17 at 08:01

score 0 · Answer 3 · answered Jun 21 '17 at 06:28

0

> library(data.table)
> dt=NULL
> dt$ID=c("A","A","A","B","B")
> dt$Col=c("M","M","M","K","M")
> dt=data.frame(dt)
> dt=data.table(dt)
> dt
   ID Col
1:  A   M
2:  A   M
3:  A   M
4:  B   K
5:  B   M


> a=dt[Col=="M",sum(.N),ID]
> b=dt[Col=="K",sum(.N),ID]
> a
   ID V1
1:  A  3
2:  B  1
> b
   ID V1
1:  B  1
> setkey(a,ID)
> setkey(b,ID)

> m=b[a]
> m
   ID V1 i.V1
1:  A NA    3
2:  B  1    1
> names(m)=c("ID","Col_K","Col_M")
> m
   ID Col_K Col_M
1:  A    NA     3
2:  B     1     1

answered Jun 21 '17 at 06:28

Ajay Ohri

3,382
3
30
60

used data.table code for selection and merging due to big data concerns – Ajay Ohri Jun 21 '17 at 06:29
it should since selection is only for M and K, prior to merge – Ajay Ohri Jun 21 '17 at 07:56
There is no K in the second example but you still need to have it in the output – David Arenburg Jun 21 '17 at 08:40
Col_K would be NA – Ajay Ohri Jun 21 '17 at 09:54
@AjayOhri, thanks for the elegant solution. Can you add one more step if percentage between Col_K, Col_M is required? Thanks! – HappyCoding Jun 21 '17 at 14:51

how to get quick summary of count in data.table

3 Answers3

Linked