I have a large data.table
with >300k rows and 884 columns (see here for data in CSV, 700MB).
I am trying to get labels for identical rows. This is what .GRP
does wonderfully in data.table
. Unfortunately, it takes forever to run, and in most cases crashes the Rsession. Any ideas on how to splitting up the problem or speeding up the solution would greatly be appreciated.
Here a MWE with the data mentioned above:
troutmat <- fread("troutmat.csv")
troutmat[,grp:=.GRP, names(troutmat)]
-> crashes Rsession (4.1.1. with data.table 1.14.2 on a 16 core Windows server)
Happy to open a bug report, if needed.
EDIT: Essentially, it's the same question as: data.table "key indices" or "group counter" but with a much larger dataset. I am trying to find a fast way to find duplicated rows, but in a way that I know which row is a duplicate of which other row.