0

I am looking to generate a unique ID based on identical values in either of two columns. Specifically, I have the phone numbers and e-mails of customers over a historic period. Therefore, I want to generate a unique ID identifying the customer, even if the customer has changed phone number or e-mail during this time period. The data looks like this:

E-mail         Phone      Name
mortena        3076       morten
kaspera        2688       kasper
christoffera   1212       christoffer
mortenb        3076       morten
mortena        3075       morten
kasperb        2688       kasper
christoffera   1213       christoffer

And I'd like to produce this result:

E-mail         Phone      Name        ID
mortena        3076       morten      1
kaspera        2688       kasper      2
christoffera   1212       christoffer 3
mortenb        3076       morten      1
mortena        3075       morten      1
kasperb        2688       kasper      2
christoffera   1213       christoffer 3

Any help is much appreciated!

I've attempted to utilize the code below. However, this seems to create ID's based on identical e-mail and phonenumbers. I am looking to generate a unique ID based on either e-mail or phonenumber.

test_data %>% 
  mutate(ID = group_indices_(test_data, .dots=c("E.mail", "Phone")))

I expect the script to check through phone numbers and generate a unique ID per unique phone number, and if it can't find any duplicate phone numbers then go through e-mail and do the same thing.

jogo
  • 12,469
  • 11
  • 37
  • 42
  • You could use `df %>% mutate(ID = as.integer(as.factor(Name)))` but this will give slightly different numbers (e.g. `"morten"` will give `ID = 3` and so on, based on the alphabetical ordering of unique `Name`s). The same can be achieved with `df %>% mutate(ID = group_indices(., Name))`. Is the particular order of `ID`s a critical requirement? – Maurits Evers Jun 09 '19 at 09:51
  • Another option is to use `match`, `match(test_data$Name, unique(test_data$Name))` – Ronak Shah Jun 09 '19 at 10:33

1 Answers1

1

Further to my comment above, to reproduce the same ID order you need to ensure the proper factor level ordering

library(dplyr)
df %>% mutate(ID = as.integer(factor(Name, levels = unique(Name))))
#        E.mail Phone        Name ID
#1      mortena  3076      morten  1
#2      kaspera  2688      kasper  2
#3 christoffera  1212 christoffer  3
#4      mortenb  3076      morten  1
#5      mortena  3075      morten  1
#6      kasperb  2688      kasper  2
#7 christoffera  1213 christoffer  3

or with group_indices

df %>% mutate(ID = group_indices(., factor(Name, levels = unique(Name))))

If the order of IDs doesn't matter you can do

df %>% mutate(ID = as.integer(as.factor(Name)))

or

df %>% mutate(ID = group_indices(., Name))

In these cases, the order of ID is determined by the alphabetical order of unique Names (rather than by their occurrence).


Sample data

df <- read.table(text = "E-mail         Phone      Name
mortena        3076       morten
kaspera        2688       kasper
christoffera   1212       christoffer
mortenb        3076       morten
mortena        3075       morten
kasperb        2688       kasper
christoffera   1213       christoffer", header = T)
Maurits Evers
  • 49,617
  • 4
  • 47
  • 68