1

I am trying to get the top frequency words in a data.table

data.table : dtable4G

key              freq     value
================================
thanks for the   612      support
thanks for the   380      drink
thanks for the   215      payment
thanks for the    27      encouragement
have a great     154      day
have a great     132      weekend
have a great      54      week  
have a great      42      time
have a great      19      night
at the same      346      time
at the same       57      damn
at the same       30      pace
at the same       11      speed
at the same        7      level
at the same        1      rate 

I tried the code

dtable4G[ , max(freq), by = key] 

and

dtable4G[ , .I[which.max(freq)] , by = key]

Both the above commands, I am getting the same result:

key              V1
====================
thanks for the   612
have a great     154
at the same      346

I want the result to be:

key              freq     value
================================
thanks for the   612      support
have a great     154      day
at the same      346      time

Any ideas what I am doing wrong?

EDITED

dtable4G[dtable4G[, .I[which.max(freq)], by = key]$V1]

worked for me. Though it took some time to run through my 5.4 mil rows.

But this was way faster than using

dtable4G[,.SD[which.max(freq)],by=key]

Reference: With data.table, is SD[which.max(Var1)] the fastest way to find the max of a group?

Community
  • 1
  • 1
PeterV
  • 195
  • 1
  • 13

1 Answers1

2

We can subset the data table for only the max freq of each key column value with the following:

dtable4G[,.SD[which.max(freq)],by=key]

For better performance you can use the below approach as well. It doesn't construct the .SD and is thus faster:

dtable4g[dtable4g[, .I[which.max(freq)], by = key]$V1]
mtoto
  • 23,919
  • 4
  • 58
  • 71