Create a column of unique identifiers based on another column in data.table

Question

As the title states, I'm trying to create a column in a data.table which would act as a unique identifier of another column. My dataset is a few hundred million observations, but here's a play set and the code I've worked up so far:

# I use a key because there are many more columns, but they are irrelevant here
myDT <- data.table(Addy=c("12hig", "12hig", "12hig", "1AbHN", "198aM"),key="Addy")

    Addy
1: 12hig
2: 12hig
3: 12hig
4: 198aM
5: 1AbHN

uniqueDT <- unique(myDT[,list(Addy)]) # is this inefficient?
uniqueDT[,mrpId := seq(1,nrow(uniqueDT),1)]

Addy mrpId
1: 12hig     1
2: 198aM     2
3: 1AbHN     3


myDT[J(uniqueDT)]
    Addy mrpId
1: 12hig     1
2: 12hig     1
3: 12hig     1
4: 198aM     2
5: 1AbHN     3

My code above gets the job done, but I don't really know if it's efficient. Is there a more data.table-esque way of doing this?

Edit:

You might be wondering why I'm creating unique identifiers from unique identifiers. Well, the idea here is to basically create a hash. The 'Addy' column data are very long strings, and I need to do operations on this data, so I think it better to operate on a smaller number of bytes.

As to your larger question, if you set `Addy` as your key (which you likely should), I'm a bit skeptical that you'll get much if any speedup by by using an alternative column containing the very same grouping information. My strong guess (but it is only a guess) is that behind the scenes -- whether they contain very short or very long strings -- any two keyed columns use the same machinery to id, subset, and operate on subgroups of the data.table. — Josh O'Brien, Mar 27 '15 at 19:22
Interesting, I'll keep that in mind. As of right now though, after a few operations in R the data is getting exported to other programs that aren't as memory efficient as `data.table`. — mrp, Mar 27 '15 at 20:16
@frank Yea, Matt Dowle's final edit covers this. However I searched for this question and didn't find that question, or this one: http://stackoverflow.com/questions/28910376/is-there-a-way-in-data-table-to-assign-ids-by-group-based-upon-an-identifier?lq=1 — mrp, Mar 28 '15 at 04:22
@mrp I don't mean the dupe vote as a criticism; I upvoted your question, too. Good find on that other question. — Frank, Mar 28 '15 at 17:43

score 4 · Accepted Answer · answered Mar 27 '15 at 19:14

4

This should be fast, and is at least a bit more straightforward:

myDT[, mrpID:=.GRP, by=Addy]
myDT
    Addy mrpID
1: 12hig     1
2: 12hig     1
3: 12hig     1
4: 198aM     2
5: 1AbHN     3

answered Mar 27 '15 at 19:14

Josh O'Brien

159,210
26
366
455

score 0 · Answer 2 · answered Mar 27 '15 at 19:21

0

Aside from data.table, the base factor class seems to be what you need:

myDT[, mrpID:=as.numeric(as.factor(Addy))]

answered Mar 27 '15 at 19:21

nicola

24,005
3
35
56

I tried going this route, but running `as.numeric(as.factor())` turned out to be a fairly slow operation. – mrp Mar 27 '15 at 19:22
Also, when doing something like this, you need to take special measures to ensure that values returned are in ascending order. (Try this to see what I mean: `as.numeric(as.factor(c("C","B","A")))`.) You can get around that particular problem with `x <- c("C", "B", "A"); as.numeric(factor(x, levels=unique(x)))` – Josh O'Brien Mar 27 '15 at 19:26

Create a column of unique identifiers based on another column in data.table

2 Answers2