As the title states, I'm trying to create a column in a data.table
which would act as a unique identifier of another column. My dataset is a few hundred million observations, but here's a play set and the code I've worked up so far:
# I use a key because there are many more columns, but they are irrelevant here
myDT <- data.table(Addy=c("12hig", "12hig", "12hig", "1AbHN", "198aM"),key="Addy")
Addy
1: 12hig
2: 12hig
3: 12hig
4: 198aM
5: 1AbHN
uniqueDT <- unique(myDT[,list(Addy)]) # is this inefficient?
uniqueDT[,mrpId := seq(1,nrow(uniqueDT),1)]
Addy mrpId
1: 12hig 1
2: 198aM 2
3: 1AbHN 3
myDT[J(uniqueDT)]
Addy mrpId
1: 12hig 1
2: 12hig 1
3: 12hig 1
4: 198aM 2
5: 1AbHN 3
My code above gets the job done, but I don't really know if it's efficient. Is there a more data.table
-esque way of doing this?
Edit:
You might be wondering why I'm creating unique identifiers from unique identifiers. Well, the idea here is to basically create a hash. The 'Addy' column data are very long strings, and I need to do operations on this data, so I think it better to operate on a smaller number of bytes.