0

I've got a large data.table (200M rows x 300 columns), DT, with multiple (over 50) identifier columns. The identifiers are all in different format and some of them are fairly complex and long, and I would like to convert all of them (selected_cols) to simple numerical identifiers.

I can use .GRP for one column at a time, and it's super fast (well, relatively speaking, in context!)

DT[, new_col_1 := .GRP , by = .(col_1)]  #this works for one column at a time

Is there a way to do this for multiple columns using the .GRP business? I know how to do it if I define my own function, using lapply, but I can't use .GRP in a function. Might be wishful thinking. I can also do it with a for-loop, but I hate for-loops, they give me the creeps as they don't scale up. Just hoping to avoid creating my own function or using for-loops for speed reasons. it's a simple operation but takes a long time for a large data.table.

DT[ , (paste0('new_', selected_cols)) := lapply(.SD, some_function_with_.GRP), .SDcols = selected_cols)] 

here's a data.table sample, if you need one:

require(data.table)
DT = data.table(col1 = c('A','B','B','D','B','A','A','B','R','T','E','E','H','T','Y','F','F','F')
               ,col2 = c('DD','GG','RR','HH','SS','AA','CC','RR','EE','DD','HH','BB','CC','AA','QQ','EE','YY','MM')
               , col3 = c('FFF1', 'HHH1', 'CCC1', 'AAA1', 'FFF1', 'RRR1', 'GGG1', 'DDD1', 'FFF1', 'JJJ1', 'VVV1', 'CCC1', 'AAA1', 'XXX1', 'GGG1', 'HHH1', 'AAA1', 'RRR1'))

And this is the output I'm after:

> DT
    col1 col2 col3 new_col1 new_col2 new_col3
 1:    A   DD FFF1        1        1        1
 2:    B   GG HHH1        2        2        2
 3:    B   RR CCC1        2        3        3
 4:    D   HH AAA1        3        4        4
 5:    B   SS FFF1        2        5        1
 6:    A   AA RRR1        1        6        5
 7:    A   CC GGG1        1        7        6
 8:    B   RR DDD1        2        3        7
 9:    R   EE FFF1        4        8        1
10:    T   DD JJJ1        5        1        8
11:    E   HH VVV1        6        4        9
12:    E   BB CCC1        6        9        3
13:    H   CC AAA1        7        7        4
14:    T   AA XXX1        5        6       10
15:    Y   QQ GGG1        8       10        6
16:    F   EE HHH1        9        8        2
17:    F   YY AAA1        9       11        4
18:    F   MM RRR1        9       12        5

I'm looking for native data.table solution.

Ankhnesmerira
  • 1,386
  • 15
  • 29

1 Answers1

1

One way would be using match and unique :

library(data.table)
cols <- paste0('col', 1:3)

DT[, paste0('new_', cols) := lapply(.SD, function(x) 
            match(x, unique(x))), .SDcols = cols]
DT

#    col1 col2 col3 new_col1 new_col2 new_col3
# 1:    A   DD FFF1        1        1        1
# 2:    B   GG HHH1        2        2        2
# 3:    B   RR CCC1        2        3        3
# 4:    D   HH AAA1        3        4        4
# 5:    B   SS FFF1        2        5        1
# 6:    A   AA RRR1        1        6        5
# 7:    A   CC GGG1        1        7        6
# 8:    B   RR DDD1        2        3        7
# 9:    R   EE FFF1        4        8        1
#10:    T   DD JJJ1        5        1        8
#11:    E   HH VVV1        6        4        9
#12:    E   BB CCC1        6        9        3
#13:    H   CC AAA1        7        7        4
#14:    T   AA XXX1        5        6       10
#15:    Y   QQ GGG1        8       10        6
#16:    F   EE HHH1        9        8        2
#17:    F   YY AAA1        9       11        4
#18:    F   MM RRR1        9       12        5
Ronak Shah
  • 377,200
  • 20
  • 156
  • 213
  • so, i take it there isn't really a way around creating a function, eh? This is still very helpful as your solution function is faster than mine. Many thanks! I'll mark it as solution if no further improvement in performance is suggested. – Ankhnesmerira Aug 26 '20 at 07:43
  • was faster than rleid on my large dataset. – Ankhnesmerira Aug 26 '20 at 09:04