Mask phone number in R

Question

My raw data has a lot of personal information, so I am masking them in R. The sample data and my original code are below:

install.packages("stringr")
library(string)

x = c("010-1234-5678",
      "John 010-8888-8888",
      "Phone: 010-1111-2222",
      "Peter 018.1111.3333",
      "Year(2007,2019,2020)",
      "Alice 01077776666")

df = data.frame(
  phoneNumber = x
)

pattern1 = "\\d{3}-\\d{4}-\\d{4}"
pattern2 = "\\d{3}.\\d{4}.\\d{4}"
pattern3 = "\\d{11}"

delPhoneList1 <- str_match_all(df, pattern1) %>% unlist
delPhoneList2 <- str_match_all(df, pattern2) %>% unlist
delPhoneList3 <- str_match_all(df, pattern3) %>% unlist

I found three types of patterns from the dataset and each result is below:

> delPhoneList1
[1] "010-1234-5678" "010-8888-8888" "010-1111-2222"
> delPhoneList2
[1] "010-1234-5678" "010-8888-8888" "010-1111-2222" "018.1111.3333" "007,2019,2020"
> delPhoneList3
[1] "01077776666"

Pattern1 is the typical type of phone number in my country using a dash, but someone types in the number like pattern2 using a comma. However, pattern2 also includes pattern1, so it detects the other pattern like a series of the year. It is an unexpected result.

My question is how to match the exact pattern that I define. The pattern2 includes excessive patterns such as "007,2019,2020" from "Year(2007,2019,2020)".

Additionally, the next step is masking the number using the below code:

for (phone in delPhoneList1) {
  df$phoneNumber <- gsub(phone, "010-9999-9999", df$phoneNumber)
}

I think the code is perfect for me, but if you had a more efficient way, please let me know.

Thanks.

What is the actual masked output you want based on the 6 sample phone numbers at the top of your question? — Tim Biegeleisen, Dec 30 '21 at 00:37
Why not just `df$phoneNumber <- gsub("(\\d{3}-\\d{4}-\\d{4})|(\\d{3}\\.\\d{4}\\.\\d{4})|(\\d{11})", "010-9999-9999", df$phoneNumber)`? — ekoam, Dec 30 '21 at 00:47
@ekoam OMG. Thank you so much!!! I am a beginner in R so I did not know the efficient way to make a code. You solved my two questions, one is a pattern problem by adding \\ after `d{3}`, and the other is a short version. — Inho Lee, Dec 30 '21 at 00:57
FYI, the `.` in pattern2 is matching "any character". Instead, escape it, as in `pattern2 = "\\d{3}\\.\\d{4}\\.\\d{4}"` and it will match just periods. Or better yet, `pattern2 = "\\d{3}[-.]\\d{4}[-.]\\d{4}"` will cover both patterns 1 and 2 in one regex. — r2evans, Dec 30 '21 at 02:15

r2evans · Accepted Answer · 2021-12-30T02:26:01.840

One pattern to rule them all ;-)

ptn <- "\\b\\d{3}([-.]?)\\d{4}\\1\\d{4}\\b"
grepl(ptn, x)
# [1]  TRUE  TRUE  TRUE  TRUE FALSE  TRUE

The reason your pattern2 failed is because it used . as a separator, but in regex that means "any character". You could have use \\. instead of . and it would have behaved better.
I'm using place holders here: if the first separator is a -, then \\1 ensures that the other separator is the same. If it's empty, then the second is empty as well. This also allows the 11 uninterrupted numbers of pattern3.
The \\b are word-boundaries, assuring us that 12-digits would not match:
```
grepl(ptn, c("12345678901", "123456789012"))
# [1]  TRUE FALSE
```

Since this has a placeholder, it tends to mess a little with stringr:: functions, but we can work around that, depending on what you need.

For instance, if you replace the placeholder with a second instance of the same pattern, it might allow 123-4444.5555 (mixed separators), if that's not a problem.

ptn2 <- "\\b\\d{3}[-.]?\\d{4}[-.]?\\d{4}\\b"
unlist(str_match_all(x, ptn2))
# [1] "010-1234-5678" "010-8888-8888" "010-1111-2222" "018.1111.3333" "01077776666"

or we can exploit the number of patterns matched (original ptn):

unlist(str_match(x, ptn)[,1])
# [1] "010-1234-5678" "010-8888-8888" "010-1111-2222" "018.1111.3333" NA              "01077776666"

Thank you for your detailed answer. I didn't know the meaning of `.` as "any character". Maybe, nobody uses mixed separators I think, but your tips make me upgrade my skills in R. :D — Inho Lee, Dec 30 '21 at 04:26
FYI, if you're using `stringr` for things, you would benefit from learning at least a few more things about regex in general. https://stackoverflow.com/a/22944075/3358272 has a good list of references, demonstrating the complexity of regex. While I don't expect it to be a great tutorial (there are other great resources for that), it's a good reference to keep around. (And I *do* think a short tutorial might be informative for you, even if to cement some of what you already know with concrete examples and definitions.) — r2evans, Dec 30 '21 at 13:38
Thank you for your additional comment. The link is very useful for me, and I can learn more regular expressions. In detail, I faced a new problem to solve another matching matter, but I can understand the meaning of `\\b` thoroughly. Thanks again, and happy new year! :D — Inho Lee, Jan 03 '22 at 04:55

Mask phone number in R

1 Answers1

Linked