2

I have a large data.table with >300k rows and 884 columns (see here for data in CSV, 700MB).

I am trying to get labels for identical rows. This is what .GRP does wonderfully in data.table. Unfortunately, it takes forever to run, and in most cases crashes the Rsession. Any ideas on how to splitting up the problem or speeding up the solution would greatly be appreciated.

Here a MWE with the data mentioned above:

  troutmat <- fread("troutmat.csv")
  troutmat[,grp:=.GRP, names(troutmat)]

-> crashes Rsession (4.1.1. with data.table 1.14.2 on a 16 core Windows server)

Happy to open a bug report, if needed.

EDIT: Essentially, it's the same question as: data.table "key indices" or "group counter" but with a much larger dataset. I am trying to find a fast way to find duplicated rows, but in a way that I know which row is a duplicate of which other row.

Puki Luki
  • 573
  • 1
  • 4
  • 13
  • Please be careful with code you provide as a MWE: this won't run unless `troutman.csv` is oddly an object that contains the filename. – r2evans Mar 04 '22 at 12:27
  • sorry, forgot the "" in the fread() command. – Puki Luki Mar 04 '22 at 12:31
  • how many threads does `data.table` use ? can you check `getDTthreads()` – Samet Sökel Mar 04 '22 at 12:34
  • @SametSökel it uses all, so 16. – Puki Luki Mar 04 '22 at 12:36
  • 1
    @PukiLuki ok, if you are grouping for every variable, isn't it the same as adding row number ? – Samet Sökel Mar 04 '22 at 12:37
  • Essentially, it's the same as the question in https://stackoverflow.com/questions/13018696/data-table-key-indices-or-group-counter , BUT with a much larger data.set. I added the reference in the question. So if I understand correctly, it's not adding row numbers, right? Thanks. – Puki Luki Mar 04 '22 at 12:41
  • A `data.table` key is intended to be used to sort the data internally (benefits binary search, faster joins, faster grouping, and other benefits). While setting *all columns* as keys is supported, does it make sense? – r2evans Mar 04 '22 at 12:53
  • 2
    I understand, thanks for the explanations. However, in this case I am trying to find a fast way to find duplicated rows, but in a way that I know which row is a duplicate of which other row. – Puki Luki Mar 04 '22 at 13:07
  • 1
    Grouping by all columns is the same as row number only if there are no duplicated rows. – jangorecki Mar 04 '22 at 16:27

0 Answers0