I've got a large data.table (200M rows x 300 columns), DT, with multiple (over 50) identifier columns. The identifiers are all in different format and some of them are fairly complex and long, and I would like to convert all of them (selected_cols) to simple numerical identifiers.
I can use .GRP for one column at a time, and it's super fast (well, relatively speaking, in context!)
DT[, new_col_1 := .GRP , by = .(col_1)] #this works for one column at a time
Is there a way to do this for multiple columns using the .GRP business? I know how to do it if I define my own function, using lapply, but I can't use .GRP in a function. Might be wishful thinking. I can also do it with a for-loop, but I hate for-loops, they give me the creeps as they don't scale up. Just hoping to avoid creating my own function or using for-loops for speed reasons. it's a simple operation but takes a long time for a large data.table.
DT[ , (paste0('new_', selected_cols)) := lapply(.SD, some_function_with_.GRP), .SDcols = selected_cols)]
here's a data.table sample, if you need one:
require(data.table)
DT = data.table(col1 = c('A','B','B','D','B','A','A','B','R','T','E','E','H','T','Y','F','F','F')
,col2 = c('DD','GG','RR','HH','SS','AA','CC','RR','EE','DD','HH','BB','CC','AA','QQ','EE','YY','MM')
, col3 = c('FFF1', 'HHH1', 'CCC1', 'AAA1', 'FFF1', 'RRR1', 'GGG1', 'DDD1', 'FFF1', 'JJJ1', 'VVV1', 'CCC1', 'AAA1', 'XXX1', 'GGG1', 'HHH1', 'AAA1', 'RRR1'))
And this is the output I'm after:
> DT
col1 col2 col3 new_col1 new_col2 new_col3
1: A DD FFF1 1 1 1
2: B GG HHH1 2 2 2
3: B RR CCC1 2 3 3
4: D HH AAA1 3 4 4
5: B SS FFF1 2 5 1
6: A AA RRR1 1 6 5
7: A CC GGG1 1 7 6
8: B RR DDD1 2 3 7
9: R EE FFF1 4 8 1
10: T DD JJJ1 5 1 8
11: E HH VVV1 6 4 9
12: E BB CCC1 6 9 3
13: H CC AAA1 7 7 4
14: T AA XXX1 5 6 10
15: Y QQ GGG1 8 10 6
16: F EE HHH1 9 8 2
17: F YY AAA1 9 11 4
18: F MM RRR1 9 12 5
I'm looking for native data.table solution.