0

My raw data has a lot of personal information, so I am masking them in R. The sample data and my original code are below:

install.packages("stringr")
library(string)

x = c("010-1234-5678",
      "John 010-8888-8888",
      "Phone: 010-1111-2222",
      "Peter 018.1111.3333",
      "Year(2007,2019,2020)",
      "Alice 01077776666")

df = data.frame(
  phoneNumber = x
)

pattern1 = "\\d{3}-\\d{4}-\\d{4}"
pattern2 = "\\d{3}.\\d{4}.\\d{4}"
pattern3 = "\\d{11}"

delPhoneList1 <- str_match_all(df, pattern1) %>% unlist
delPhoneList2 <- str_match_all(df, pattern2) %>% unlist
delPhoneList3 <- str_match_all(df, pattern3) %>% unlist

I found three types of patterns from the dataset and each result is below:

> delPhoneList1
[1] "010-1234-5678" "010-8888-8888" "010-1111-2222"
> delPhoneList2
[1] "010-1234-5678" "010-8888-8888" "010-1111-2222" "018.1111.3333" "007,2019,2020"
> delPhoneList3
[1] "01077776666"

Pattern1 is the typical type of phone number in my country using a dash, but someone types in the number like pattern2 using a comma. However, pattern2 also includes pattern1, so it detects the other pattern like a series of the year. It is an unexpected result.

My question is how to match the exact pattern that I define. The pattern2 includes excessive patterns such as "007,2019,2020" from "Year(2007,2019,2020)".

Additionally, the next step is masking the number using the below code:

for (phone in delPhoneList1) {
  df$phoneNumber <- gsub(phone, "010-9999-9999", df$phoneNumber)
}

I think the code is perfect for me, but if you had a more efficient way, please let me know.

Thanks.

Inho Lee
  • 127
  • 1
  • 12
  • What is the actual masked output you want based on the 6 sample phone numbers at the top of your question? – Tim Biegeleisen Dec 30 '21 at 00:37
  • 1
    Why not just `df$phoneNumber <- gsub("(\\d{3}-\\d{4}-\\d{4})|(\\d{3}\\.\\d{4}\\.\\d{4})|(\\d{11})", "010-9999-9999", df$phoneNumber)`? – ekoam Dec 30 '21 at 00:47
  • @ekoam OMG. Thank you so much!!! I am a beginner in R so I did not know the efficient way to make a code. You solved my two questions, one is a pattern problem by adding \\ after `d{3}`, and the other is a short version. – Inho Lee Dec 30 '21 at 00:57
  • 1
    FYI, the `.` in pattern2 is matching "any character". Instead, escape it, as in `pattern2 = "\\d{3}\\.\\d{4}\\.\\d{4}"` and it will match just periods. Or better yet, `pattern2 = "\\d{3}[-.]\\d{4}[-.]\\d{4}"` will cover both patterns 1 and 2 in one regex. – r2evans Dec 30 '21 at 02:15

1 Answers1

1

One pattern to rule them all ;-)

ptn <- "\\b\\d{3}([-.]?)\\d{4}\\1\\d{4}\\b"
grepl(ptn, x)
# [1]  TRUE  TRUE  TRUE  TRUE FALSE  TRUE
  • The reason your pattern2 failed is because it used . as a separator, but in regex that means "any character". You could have use \\. instead of . and it would have behaved better.

  • I'm using place holders here: if the first separator is a -, then \\1 ensures that the other separator is the same. If it's empty, then the second is empty as well. This also allows the 11 uninterrupted numbers of pattern3.

  • The \\b are word-boundaries, assuring us that 12-digits would not match:

    grepl(ptn, c("12345678901", "123456789012"))
    # [1]  TRUE FALSE
    

Since this has a placeholder, it tends to mess a little with stringr:: functions, but we can work around that, depending on what you need.

For instance, if you replace the placeholder with a second instance of the same pattern, it might allow 123-4444.5555 (mixed separators), if that's not a problem.

ptn2 <- "\\b\\d{3}[-.]?\\d{4}[-.]?\\d{4}\\b"
unlist(str_match_all(x, ptn2))
# [1] "010-1234-5678" "010-8888-8888" "010-1111-2222" "018.1111.3333" "01077776666"  

or we can exploit the number of patterns matched (original ptn):

unlist(str_match(x, ptn)[,1])
# [1] "010-1234-5678" "010-8888-8888" "010-1111-2222" "018.1111.3333" NA              "01077776666"  
r2evans
  • 141,215
  • 6
  • 77
  • 149
  • 1
    Thank you for your detailed answer. I didn't know the meaning of `.` as "any character". Maybe, nobody uses mixed separators I think, but your tips make me upgrade my skills in R. :D – Inho Lee Dec 30 '21 at 04:26
  • 1
    FYI, if you're using `stringr` for things, you would benefit from learning at least a few more things about regex in general. https://stackoverflow.com/a/22944075/3358272 has a good list of references, demonstrating the complexity of regex. While I don't expect it to be a great tutorial (there are other great resources for that), it's a good reference to keep around. (And I *do* think a short tutorial might be informative for you, even if to cement some of what you already know with concrete examples and definitions.) – r2evans Dec 30 '21 at 13:38
  • Thank you for your additional comment. The link is very useful for me, and I can learn more regular expressions. In detail, I faced a new problem to solve another matching matter, but I can understand the meaning of `\\b` thoroughly. Thanks again, and happy new year! :D – Inho Lee Jan 03 '22 at 04:55