Is it possible to use gsub to replace each character of a match with another character? I have read and tried solutions from a lot of questions without success, because they were very specific to the example being used. Some that looked promising but ultimately did not get me there are
gsub-replace-regex-match-with-regex-replacement-string
replace-pattern-with-one-space-per-character-in-perl
What I am looking for is a general way to do the following. I have a list of regexes, which I combine into a single regex expression of the form
pattern <- "[0-9]{3,}|[a-z]{3,}|..."
Given a string such as
x <- "1234 abc 12 a 123456"
I would like to get back from gsub the string with each character of a match replaced by #
"#### ### 12 a ######"
instead of
"# # 12 a #"
I have used gsub
with the perl
arg set to TRUE
, and experimented with an online regex tool, using things like \G
and lookarounds, but I cannot figure it out.
The reason I am looking for a way to do this with gsub
(I realise it is easy to do in other ways) is to use it as a method of censoring certain words and matches such as dates, phone numbers and email addresses in a dplyr
pipeline. The function I have works fine, except that any replacement is fixed, and I would like to replace each matching character, rather than each matching substring.
filter_words <- function(.data, .words, .replacement, ...) {
.data %>% dplyr::mutate(
dplyr::across(
c(...),
~ gsub(
paste0("\\b", .words, collapse = "|\\b"),
.replacement, .,
ignore.case = TRUE, perl = TRUE
)
)
)
}
I did try using a package called mgsub
for the mgsub_censor
function it provides. This does work, but it is several orders of magnitude slower than what I already have, so not really practical for large datasets.
I did try creating a custom gsub
function able to accept a function (that could return a string consisting of the same number of characters as each match) as the replacement argument. It worked fine for a single string, but failed to work in a pipe.