42

I am interested in de-identifying a sensitive data set with both time-fixed and time-variant values. I want to (a) group all cases by social security number, (b) assign those cases a unique ID and then (c) remove the social security number.

Here's an example data set:

personal_id    gender  temperature
111-11-1111      M        99.6
999-999-999      F        98.2
111-11-1111      M        97.8
999-999-999      F        98.3
888-88-8888      F        99.0
111-11-1111      M        98.9

Any solutions would be very much appreciated.

A. Suliman
  • 12,923
  • 5
  • 24
  • 37
B Victor
  • 487
  • 1
  • 4
  • 5
  • Maybe a lazy solution, but I suppose you could just hash the social security numbers. – Chrisss Sep 22 '16 at 23:57
  • One method would be `set.seed(1234); levels(personal_id) <- sample(length(levels(personal_id)))` Here, the seed would provide a "decryption" key, so you'd want to either hide that or not save it. – lmo Sep 23 '16 at 00:01

3 Answers3

76

dplyr::group_indices() is deprecated as of dplyr 1.0.0. dplyr::cur_group_id() should be used instead:

df %>%
 group_by(personal_id) %>%
 mutate(group_id = cur_group_id())

  personal_id gender temperature group_id
  <chr>       <chr>        <dbl>    <int>
1 111-11-1111 M             99.6        1
2 999-999-999 F             98.2        3
3 111-11-1111 M             97.8        1
4 999-999-999 F             98.3        3
5 888-88-8888 F             99          2
6 111-11-1111 M             98.9        1
tmfmnk
  • 38,881
  • 4
  • 47
  • 67
58

dplyr has a group_indices function for creating unique group IDs

library(dplyr)
data <- data.frame(personal_id = c("111-111-111", "999-999-999", "222-222-222", "111-111-111"),
                       gender = c("M", "F", "M", "M"),
                       temperature = c(99.6, 98.2, 97.8, 95.5))

data$group_id <- data %>% group_indices(personal_id) 
data <- data %>% select(-personal_id)

data
  gender temperature group_id
1      M        99.6        1
2      F        98.2        3
3      M        97.8        2
4      M        95.5        1

Or within the same pipeline (https://github.com/tidyverse/dplyr/issues/2160):

data %>% 
    mutate(group_id = group_indices(., personal_id))
Tyler Rinker
  • 108,132
  • 65
  • 322
  • 519
conor
  • 1,267
  • 10
  • 7
  • 7
    Unfortunately, `group_indices()` appears to automatically sort personal_id before creating the group_id, which is not always desired. – user2363777 Sep 06 '19 at 19:24
  • 15
    `group_indices()` was deprecated in dplyr 1.0.0. Please use `cur_group_id()` instead now. – sequoia May 30 '21 at 11:36
2

Using dplyr package :

library(dplyr)
data <- data.frame(personal_id = c("111-111-111", "999-999-999", "222-222-222", "111-111-111"),
                 gender = c("M", "F", "M", "M"),
                 temperature = c(99.6, 98.2, 97.8, 95.5))

first you extract the personal_id in order to create a unique ID :

cases <- data.frame(levels = levels(data$personal_id))

using rownames, you get a unique identifier :

cases <- cases %>%
    mutate(id = rownames(cases))

results :

       levels id
1 111-111-111  1
2 222-222-222  2
3 999-999-999  3

then you join the cases dataframe with your data :

data <- left_join(data, cases, by = c("personal_id" = "levels"))

you create a more unique ID by pasting the id generated with the gender :

mutate(UID = paste(id, gender, sep=""))

and finally remove the personal_id and the simple id :

select(-personal_id, -id)

and there you go :) :

data <- left_join(data, cases, by = c("personal_id" = "levels")) %>%
        mutate(UID = paste(id, gender, sep="")) %>%
        select(-personal_id, -id)

results :

  gender temperature UID
1      M        99.6  1M
2      F        98.2  3F
3      M        97.8  2M
4      M        95.5  1M
Menelith
  • 521
  • 2
  • 4
  • 13