1
  • I have a dataframe with 8 unique values

     data<-data.frame(id=c("ab","cc","cc","dd","ee","ff","ee","ff","ab","dd","gg",1,"air"))
     >data
           id
        1  ab
        2  cc
        3  cc
        4  dd
        5  ee
        6  ff
        7  ee
        8  ff
        9  ab
        10 dd
        11 gg
        12 1
        13 air 
    
  • I create another dataframe holding 8 unique values that are to be used as replacements

     library(random)
     replacements<-data.frame(value=randomStrings(n=8, len=2, digits=FALSE,loweralpha=TRUE, unique=TRUE, check=TRUE))
     replacements 
      V1
     1 SJ
     2 fH
     3 TZ
     4 Mr
     5 oZ
     6 kZ
     7 fe
     8 ql
    
  • I want to replace all unique values from data dataframe with values in replacement dataframe in below way

All ab values replaced by SJ
All cc values replaced by fH
All dd values replaced by TZ
All ee values replaced by Mr
All ff values replaced by oZ
All gg values replaced by kZ
All 1 values replaced by fe
All air values replaced by ql

  • Currently, I am achieving this by:

        data<-data.frame(id=c("ab","cc","cc","dd","ee","ff","ee","ff","ab","dd","gg",1,"air"))
        data$id<-as.character(data$id)
        replacements<-data.frame(value=randomStrings(n=8, len=2, digits=FALSE,loweralpha=TRUE, unique=TRUE, check=TRUE))
        replacements$V1<-as.character(replacements$V1)
        for(i in 1:length(unique(data$id))){
             data$id[data$id %in% data$id[i]] <- replacements$V1[i]
        }  
    
    
        >data
           id
        1  SJ
        2  fH
        3  fH
        4  TZ
        5  Mr
        6  oZ
        7  Mr
        8  oZ
        9  SJ
        10 TZ
        11 kZ
        12 fe
        13 ql
    
  • Is there any base function in R to achieve? Is there better approach than this for masking data?

Akki
  • 1,221
  • 3
  • 14
  • 33

2 Answers2

1

I would suggest using merge(), but to do that you would first need to add a column of unique data$id to replacements, as both data.frames need to have a column in common.

Here's data:

> data
    id
1   ab
2   cc
3   cc
4   dd
5   ee
6   ff
7   ee
8   ff
9   ab
10  dd
11  gg
12   1
13 air

Here's replacements:

> replacements
  V1
1 VS
2 Of
3 bH
4 iJ
5 jm
6 kH
7 cm
8 rQ

So add unique data$id to replacements:

replacements$id <- unique(data$id)

Giving:

  V1  id
1 VS  ab
2 Of  cc
3 bH  dd
4 iJ  ee
5 jm  ff
6 kH  gg
7 cm   1
8 rQ air

Then merge data with replacements using id:

data <- merge(data, replacements, by = "id", all.x = TRUE, sort = FALSE)

Giving:

    id V1
1   ab VS
2   ab VS
3   cc Of
4   cc Of
5   dd bH
6   dd bH
7   ee iJ
8   ee iJ
9   ff jm
10  ff jm
11  gg kH
12   1 cm
13 air rQ

If you really wanted to keep only the new id column, you could drop the original id and rename the new column:

data <- data[, 2, drop = FALSE]
colnames(data) <- "id"

Giving:

   id
1  VS
2  VS
3  Of
4  Of
5  bH
6  bH
7  iJ
8  iJ
9  jm
10 jm
11 kH
12 cm
13 rQ
Stuart Allen
  • 1,537
  • 1
  • 10
  • 19
1
  • Masking data using algorithm CRC32

    library(data.table)
    library(digest)
    data<-data.frame(id=c("ab","cc","cc","dd","ee","ff","ee","ff","ab","dd","gg",1,"air"))
    setDT(data)
    
    anonymize <- function(x, algo="crc32"){
        unq_hashes <- vapply(unique(x), function(object) digest(object, algo=algo), FUN.VALUE="", USE.NAMES=TRUE)
        unname(unq_hashes[x])
    }
    
    cols_to_mask <- c("id")
    data[,cols_to_mask := lapply(.SD, anonymize),.SDcols=cols_to_mask,with=FALSE]
    

References:Data anonymization in R

Akki
  • 1,221
  • 3
  • 14
  • 33