1

If needed I can put together a dataset, but my question is somewhat general.

accts <- accts[, .SD[which.max(EE)], by=DnB.Name]

I've got a DT of about 350k rows, and some of the DnB.Name's (Duns and Bradstreet Company Name) are duplicates with differing employee counts (EE), I only care about the highest number of each and can disregard the rest.

Anyway, DT is usually lightning quick, so I figure I must be doing something wrong?

wizard_draziw
  • 505
  • 1
  • 5
  • 17
  • Probably to do with using `.SD`, which means you are having to load a whole chunk of your `data.table` for each `by` group. See here: http://stackoverflow.com/questions/12707368/finding-the-index-of-a-max-value-in-r – thelatemail May 02 '14 at 02:21
  • 2
    The fastest way currently would be to use `.I` instead as shown [here](http://stackoverflow.com/a/16574176/559784). In your case, just replace the expression inside to `.I[which.max(..)]`. – Arun May 02 '14 at 08:28

1 Answers1

2

sort by EE, then take the first row for each group using a self join:

 ordered<-accts[order(-EE)] #Descending order
 setkey(ordered,DnB.Name) #must setkey before join
 ordered[J(unique(DnB.Name)),mult="first"]

For reference, check out this post on crossvalidated: https://stats.stackexchange.com/questions/7884/fast-ways-in-r-to-get-the-first-row-of-a-data-frame-grouped-by-an-identifier

EDIT: even faster, but weird syntax:

accts[accts[, .I[which.max(EE)], by = DnB.Name]$V1]

For reference, check this post with a similar question: Subset by group with data.table

Community
  • 1
  • 1
Ben Rollert
  • 1,564
  • 1
  • 13
  • 21
  • clever, still really slow though. But I think this is the right direction. How would I keep the first instance of each DnB.Name but remove duplicates without the .SD, because I think what's going on is that since duplicates are rare, it's subsetting 340k times – wizard_draziw May 02 '14 at 02:46
  • try the self join method – Ben Rollert May 02 '14 at 02:55
  • what is the ID variable? – wizard_draziw May 02 '14 at 03:00
  • sorry, was thinking generally, updated with DnB.name. This should be lightning fast. – Ben Rollert May 02 '14 at 03:03
  • I see, gotta keep the first line, skip the second, then proceed with your edit but ID should be DnB.Name. Brilliant, super crazy quick. Make those last edits and I'll give you the check mark :) – wizard_draziw May 02 '14 at 03:05
  • Updated my answer. Although the solution using `.I` is a bit confusing to the average data.table user. I'm sure you guys have plans to implement that under the hood so the user is abstracted from it. I'm certainly not complaining ;) – Ben Rollert May 02 '14 at 09:56
  • Yo dawg, I hear you like data.tables, so I put a data.table in a data.table..... errr couldn't resist. Could you explain what's going on in your exotic second solution? I didn't realize you could nest data.tables in data.tables, aside from joins... then again, maybe I just answered my own question? – wizard_draziw May 02 '14 at 22:24