R - Group by variable and then assign a unique ID

Question

I am interested in de-identifying a sensitive data set with both time-fixed and time-variant values. I want to (a) group all cases by social security number, (b) assign those cases a unique ID and then (c) remove the social security number.

Here's an example data set:

personal_id    gender  temperature
111-11-1111      M        99.6
999-999-999      F        98.2
111-11-1111      M        97.8
999-999-999      F        98.3
888-88-8888      F        99.0
111-11-1111      M        98.9

Any solutions would be very much appreciated.

Maybe a lazy solution, but I suppose you could just hash the social security numbers. — Chrisss, Sep 22 '16 at 23:57
One method would be `set.seed(1234); levels(personal_id) <- sample(length(levels(personal_id)))` Here, the seed would provide a "decryption" key, so you'd want to either hide that or not save it. — lmo, Sep 23 '16 at 00:01

score 76 · Answer 1 · answered Jun 25 '20 at 13:40

dplyr::group_indices() is deprecated as of dplyr 1.0.0. dplyr::cur_group_id() should be used instead:

df %>%
 group_by(personal_id) %>%
 mutate(group_id = cur_group_id())

  personal_id gender temperature group_id
  <chr>       <chr>        <dbl>    <int>
1 111-11-1111 M             99.6        1
2 999-999-999 F             98.2        3
3 111-11-1111 M             97.8        1
4 999-999-999 F             98.3        3
5 888-88-8888 F             99          2
6 111-11-1111 M             98.9        1

This should be the new accepted answer! – Avery Robbins Feb 18 '21 at 18:45 — Avery Robbins, Feb 18 '21 at 18:45

score 58 · Accepted Answer · edited May 22 '18 at 15:52

58

dplyr has a group_indices function for creating unique group IDs

library(dplyr)
data <- data.frame(personal_id = c("111-111-111", "999-999-999", "222-222-222", "111-111-111"),
                       gender = c("M", "F", "M", "M"),
                       temperature = c(99.6, 98.2, 97.8, 95.5))

data$group_id <- data %>% group_indices(personal_id) 
data <- data %>% select(-personal_id)

data
  gender temperature group_id
1      M        99.6        1
2      F        98.2        3
3      M        97.8        2
4      M        95.5        1

Or within the same pipeline (https://github.com/tidyverse/dplyr/issues/2160):

data %>% 
    mutate(group_id = group_indices(., personal_id))

edited May 22 '18 at 15:52

Tyler Rinker

108,132
65
322
519

answered Sep 28 '16 at 19:41

conor

1,267
10
7

7

Unfortunately, `group_indices()` appears to automatically sort personal_id before creating the group_id, which is not always desired. – user2363777 Sep 06 '19 at 19:24
15

`group_indices()` was deprecated in dplyr 1.0.0. Please use `cur_group_id()` instead now. – sequoia May 30 '21 at 11:36

score 2 · Answer 3 · answered Sep 26 '16 at 09:59

Using dplyr package :

library(dplyr)
data <- data.frame(personal_id = c("111-111-111", "999-999-999", "222-222-222", "111-111-111"),
                 gender = c("M", "F", "M", "M"),
                 temperature = c(99.6, 98.2, 97.8, 95.5))

first you extract the personal_id in order to create a unique ID :

cases <- data.frame(levels = levels(data$personal_id))

using rownames, you get a unique identifier :

cases <- cases %>%
    mutate(id = rownames(cases))

results :

       levels id
1 111-111-111  1
2 222-222-222  2
3 999-999-999  3

then you join the cases dataframe with your data :

data <- left_join(data, cases, by = c("personal_id" = "levels"))

you create a more unique ID by pasting the id generated with the gender :

mutate(UID = paste(id, gender, sep=""))

and finally remove the personal_id and the simple id :

select(-personal_id, -id)

and there you go :) :

data <- left_join(data, cases, by = c("personal_id" = "levels")) %>%
        mutate(UID = paste(id, gender, sep="")) %>%
        select(-personal_id, -id)

results :

  gender temperature UID
1      M        99.6  1M
2      F        98.2  3F
3      M        97.8  2M
4      M        95.5  1M

R - Group by variable and then assign a unique ID

3 Answers3

Linked

Related