-2

I have a data.frame with 4 columns, "id1", "id2", "id3", and "name", where "id1", "id2", and "id3" are very long strings.

I need to connect "id1", "id2", and "id3" together to generate a new key; "id1" + "id2" + "id3" may be duplicated. In other words, each new key is a new combination of the values in "id1" + "id2" + "id3".

I want this new key simple and short such as 'key1', 'key2' etc...

A5C1D2H2I1M1N2O1R2T1
  • 190,393
  • 28
  • 405
  • 485
linus
  • 331
  • 2
  • 9
  • 1
    but you "newbie brain" can dot it inefficiently and show what he have tried? Some input data also and the expected result... – agstudy Jul 10 '13 at 11:15
  • 1
    For thoses how downvotes without any explanation, I don't think your are of a great help for the OP. Actually, it is a good question but not well formulated. – agstudy Jul 10 '13 at 11:17

2 Answers2

1

Something like this?

set.seed(42)
DF <- data.frame(key1=sample(letters[1:2],10,TRUE),
                 key2=sample(letters[1:2],10,TRUE),
                 key3=sample(letters[1:2],10,TRUE))

DF <- within(DF,{newkey <- interaction(key1,key2,key3,drop=TRUE)
                 levels(newkey) <- paste0("key",seq_along(levels(newkey)))
                 })

#    key1 key2 key3 newkey
# 1     b    a    b   key4
# 2     b    b    a   key2
# 3     a    b    b   key5
# 4     b    a    b   key4
# 5     b    a    a   key1
# 6     b    b    b   key6
# 7     b    b    a   key2
# 8     a    a    b   key3
# 9     b    a    a   key1
# 10    b    b    b   key6
Roland
  • 127,288
  • 10
  • 191
  • 288
  • +1! maybe I would create longer key1,key2,.. string in my sample. – agstudy Jul 10 '13 at 11:46
  • when data size is small, it work great. But when I tried data frame with 500k rows, it got overflow error, as R try to allocate a vector of length of 1340 million – linus Jul 10 '13 at 15:29
  • How many unique key values are there for each id? Please give a reproducible example of how your data actually looks like. You can find out [here](http://stackoverflow.com/a/5963610/1412059) how to do that. – Roland Jul 10 '13 at 17:03
1

I would recommend using .GRP from "data.table" fro this. It also shouldn't struggle with 500k rows of data:

library(data.table)
as.data.table(DF)[, combined := .GRP, by = names(DF)][]
#     key1 key2 key3 combined
#  1:    b    a    b        1
#  2:    b    b    a        2
#  3:    a    b    b        3
#  4:    b    a    b        1
#  5:    b    a    a        4
#  6:    b    b    b        5
#  7:    b    b    a        2
#  8:    a    a    b        6
#  9:    b    a    a        4
# 10:    b    b    b        5

If you need the combined key to be sorted according to a sorted set of the other keys, use setkey before doing the above step.

A5C1D2H2I1M1N2O1R2T1
  • 190,393
  • 28
  • 405
  • 485