2

I am masking phone numbers and personal names on my raw data. I already asked and got the answer here for my work about phone numbers.

In the case of masking personal names, I have the following code:

x = c("010-1234-5678",
      "John 010-8888-8888",
      "Phone: 010-1111-2222",
      "Peter 018.1111.3333",
      "Year(2007,2019,2020)",
      "Alice 01077776666")

df = data.frame(
  phoneNumber = x
)

delName = c("John", "Peter", "Alice")

for (name in delName) {
  df$phoneNumber <- gsub(name, "anonymous", df$phoneNumber)
}

That code is not a problem for me,

> df
              phoneNumber
1           010-1234-5678
2 anonymous 010-8888-8888
3    Phone: 010-1111-2222
4 anonymous 018.1111.3333
5    Year(2007,2019,2020)
6   anonymous 01077776666

but I have over 10,000 personal names to mask. R is working 789th process now. Time can solve it, but I would like to know the way to reduce processing time. I searched foreach, but I do not know how to tune my original code above.

AndrewGB
  • 16,126
  • 5
  • 18
  • 49
Inho Lee
  • 127
  • 1
  • 12

2 Answers2

2

You could try this without a loop first and paste strings together with an or \.

(delNamec <- paste(delName, collapse='|'))
# [1] "John|Peter|Alice"

gsub(delNamec, 'anonymous', df$phoneNumber)
# [1] "010-1234-5678"          
# [2] "anonymous 010-8888-8888"
# [3] "Phone: 010-1111-2222"   
# [4] "anonymous 018.1111.3333"
# [5] "Year(2007,2019,2020)"   
# [6] "anonymous 01077776666" 

Runs in a blink of an eye, even with 100k rows.

df2 <- df[sample(nrow(df), 1e5, replace=T),,drop=F]
dim(df2)
# [1] 100000      1
system.time(gsub(delNamec, 'anonymous', df2$phoneNumber))
#  user  system elapsed 
# 0.129   0.000   0.129 
jay.sf
  • 60,139
  • 8
  • 53
  • 110
2

Here is another option using stringr, which is faster than gsub.

library(stringr)

str_replace_all(
  string = df$phoneNumber,
  pattern = paste(delName, collapse = '|'),
  replacement = "anonymous"
)

# [1] "010-1234-5678"          
# [2] "anonymous 010-8888-8888"
# [3] "Phone: 010-1111-2222"   
# [4] "anonymous 018.1111.3333"
# [5] "Year(2007,2019,2020)"   
# [6] "anonymous 01077776666" 

Benchmark (Thanks @jay.sf for the df2!)

df2 <- df[sample(nrow(df), 1e5, replace=T),,drop=F]
dim(df2)
# [1] 100000      1

bench::mark(
  stringr = str_replace_all(
    string = df2$phoneNumber,
    pattern = paste(delName, collapse = '|'),
    replacement = "anonymous"
  ),
  gsub = gsub(delNamec, 'anonymous', df2$phoneNumber)
)

# A tibble: 2 × 13
#  expression      min   median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc total_time  
#  <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl> <int> <dbl>   <bch:tm>
# 1 stringr      45.4ms   46.7ms     20.9      781KB        0    11     0      525ms
# 2 gsub           97ms  111.8ms      9.18     781KB        0     5     0      544ms
AndrewGB
  • 16,126
  • 5
  • 18
  • 49
  • 1
    Thank you for your answer and benchmark example, but in my case, `stringr` and `gsub` are not different performance significantly. I changed the number of data to 1e7, but `stringr` and `gsub` spend 3.45s and 4.08s, respectively. – Inho Lee Jan 02 '22 at 23:21
  • Could I solve following error message? I consider spliting the list by 600 names, but some list has over 13,000 names. ```Error in gsub(delNamec, "anonymous", emrMaster$cardex) : assertion 'tree->num_tags == num_tags' failed in executing regexp: file 'tre-compile.c', line 634``` – Inho Lee Jan 12 '22 at 01:24
  • @InhoLee It might depend what is on line 634 since that's in the error message. I'm not sure that I can determine what it is without a little more context. You might also be better off posting a new question to address that specific issue. – AndrewGB Jan 12 '22 at 05:51
  • 1
    Okay, Thank you for your additional comment! :D – Inho Lee Jan 12 '22 at 23:56