2

Is it possible to use gsub to replace each character of a match with another character? I have read and tried solutions from a lot of questions without success, because they were very specific to the example being used. Some that looked promising but ultimately did not get me there are

gsub-replace-regex-match-with-regex-replacement-string

replace-pattern-with-one-space-per-character-in-perl

What I am looking for is a general way to do the following. I have a list of regexes, which I combine into a single regex expression of the form

pattern <- "[0-9]{3,}|[a-z]{3,}|..."

Given a string such as

x <- "1234 abc 12 a 123456"

I would like to get back from gsub the string with each character of a match replaced by #

"#### ### 12 a ######"

instead of

"# # 12 a #"

I have used gsub with the perl arg set to TRUE, and experimented with an online regex tool, using things like \G and lookarounds, but I cannot figure it out.

The reason I am looking for a way to do this with gsub (I realise it is easy to do in other ways) is to use it as a method of censoring certain words and matches such as dates, phone numbers and email addresses in a dplyr pipeline. The function I have works fine, except that any replacement is fixed, and I would like to replace each matching character, rather than each matching substring.

filter_words <- function(.data, .words, .replacement, ...) {
  .data %>% dplyr::mutate(
    dplyr::across(
      c(...),
      ~ gsub(
          paste0("\\b", .words, collapse = "|\\b"),
          .replacement, .,
          ignore.case = TRUE, perl = TRUE
      )
    )
  )
}

I did try using a package called mgsub for the mgsub_censor function it provides. This does work, but it is several orders of magnitude slower than what I already have, so not really practical for large datasets.

I did try creating a custom gsub function able to accept a function (that could return a string consisting of the same number of characters as each match) as the replacement argument. It worked fine for a single string, but failed to work in a pipe.

msm1089
  • 1,314
  • 1
  • 8
  • 19
  • I tagged with perl since I am using gsub with perl=TRUE. – msm1089 Jan 29 '22 at 06:11
  • You need to state, at the beginning, the rule you are using for replacements. The only evidence for that is that you want to convert `"1234 abc 12 a 123456"` to `"#### ### 12 a ######"`. I can think of many rules that would achieve that: Two are the following: 1. convert every character in a string of 3 or more characters other than spaces to `'#'`; 2) 1. convert every character in a string of 3 or more letters or 4 or more digits to `'#'`. Please edit to clarify. – Cary Swoveland Jan 29 '22 at 07:06
  • @CarySwoveland - I think it is clear from the initial question: "...replace each character of a match with another character?". I have added what I want to get to the example anyway. – msm1089 Jan 29 '22 at 08:07

1 Answers1

2

You may pass a function in str_replace_all and use strrep to repeat the # symbol n times.

x <- "1234 abc 12 a 123456"
pattern <- "[0-9]{3,}|[a-z]{3,}"

stringr::str_replace_all(x, pattern, function(m) strrep('#', nchar(m)))
#[1] "#### ### 12 a ######"
Ronak Shah
  • 377,200
  • 20
  • 156
  • 213