You have to write some logic
You will not find a pure regex solution for this. Similar SO questions in C# and JS contain extensive logical flow to determine which characters are capitals.
Furthermore, these questions have additional constraints which make them considerably simpler than your question:
- The pattern and the replacement are the same length.
- Each character in the pattern has a unique replacement character, e.g.
"abcd" => "wxyz"
.
As a response to a similar question on the Rust reddit states:
There's a lot of possible ways that this could go wrong. For example, what should happen if you try to replace with a different number of characters ("abc" -> "wxyz")? What if you have a mapping with multiple outgoing links ("aaa" -> "xyz")?
This is precisely what you are trying to do. Where the pattern and replacement are a different length, in general you want the index of each capital in the pattern to be mapped to the index in the replacement, e.g. "daTe" => ""moNth
. However, sometimes you do not, e.g. "DATE" => "MONTH"
, and not "MONTh"
. Even if there were a regex flavour with some sort of \U
equivalent (which is a nice question), to cope with patterns and replacements with different lengths, regex cannot be enough.
Another complication is the letters in the pattern or replacement are not guaranteed to be unique: you want to be able to replace "WEEK"
with "MONTH"
and vice versa. This rules out character hash map approaches like the Rust answer. The Perl response linked in comments can cope with different length replacements. However, to generalise it to more than just the first letter would require a pattern setting out all possible permutations of capitals and lower case letters. This would be at least 2^n
patterns, where n
is the number of letters in the word being replaced. This doesn't get you much further than doing the same in R or any language.
R solution
I have written a function swap()
, which will do this for you with two strings, even with different numbers of letters:
x <- "This Date is a DATE that is daTe and date."
swap("date", "month", x)
# [1] "This Month is a MONTH that is moNth and month."
How it works
The swap()
function uses Reduce()
in a pretty similar way to this answer:
swap <- function(old, new, str, preserve_boundaries = TRUE) {
l <- create_replacement_pairs(old, new, str, preserve_boundaries)
Reduce(\(x, l) gsub(l[1], l[2], x, fixed = TRUE), l, init = str)
}
The workhorse function is create_replacement_pairs()
, which creates a list of pairs of patterns that actually appears in the string, e.g. c("daTe", "DATE")
, and generates replacements with the correct case, e.g. c("moNth", "MONTH")
. The function logic is:
- Find all matches in the string, e.g.
"Date" "DATE" "daTe" "date"
.
- Create a boolean mask indicating whether each letter is a capital.
- If all letters are capitals, the replacement should also be all caps, e.g.
"DATE" => "MONTH"
. Otherwise make the letter at each index in the replacement a capital if the letter at the corresponding index in the pattern is a capital.
create_replacement_pairs <- function(old = "date", new = "month", str, preserve_boundaries) {
if (preserve_boundaries) {
pattern <- paste0("\\b", old, "\\b")
} else {
pattern <- old
}
matches <- unique(unlist(
regmatches(str, gregexpr(pattern, str, ignore.case = TRUE))
)) # e.g. "Date" "DATE" "daTe" "date"
capital_shift <- lapply(matches, \(x) {
out_length <- nchar(new)
# Boolean mask if <= capital Z
capitals <- utf8ToInt(x) <= 90
# If e.g. DATE, replacement should be
# MONTH and not MONTh
if (all(capitals)) {
shift <- rep(32, out_length)
} else {
# If not all capitals replace corresponding
# index with capital e.g. daTe => moNth
# Pad with lower case if replacement is longer
length_diff <- max(out_length - nchar(old), 0)
shift <- c(
ifelse(capitals, 32, 0),
rep(0, length_diff)
)[1:out_length] # truncate if replacement shorter than pattern
}
})
replacements <- lapply(capital_shift, \(x) {
paste(vapply(
utf8ToInt(new) - x,
intToUtf8,
character(1)
), collapse = "")
})
replacement_list <- Map(\(x, y) c(old = x, new = y), matches, replacements)
replacement_list
}
Use cases
This approach is not subject to the same constraints as the Rust and C# answers linked at the start of this answer. We have already seen this works where the replacement is longer than the pattern. The converse is also true:
swap("date", "day", x)
# [1] "This Day is a DAY that is daY and day."
Furthermore, as it does not use a hash map, it works in cases where the letters in the replacement are not unique.
swap("date", "week", x)
# [1] "This Week is a WEEK that is weEk and week."
It also works where the letters in the pattern are not unique:
swap("that", "which", x)
# [1] "This Date is a DATE which is daTe and date."
Edit: Thanks to @shs for pointing out in the comments that this did not preserve word boundaries. It now does by default, but you can disable this with preserve_boundaries = FALSE
:
swap("date", "week", "this dAte is dated", preserve_boundaries = FALSE)
# [1] "this wEek is weekd"
swap("date", "week", "this dAte is dated")
# [1] "this wEek is dated"
Performance
Dynamically generating matches from the lower case arguments in this way will not be quite as fast as hardcoding list(c("Date", "Month"), c("DATE", "MONTH"), c("daTe", "moNth"), c("date", "month"))
. However a fair comparison should probably include the time it takes to type that list, which I doubt could be done in less than the ten-thousandth of a second the function takes to return, even by the most committed vim user.
I had the benefit of seeing the benchmarks in Tyler Rinker's answer so have used Reduce()
and gsub()
, which is the fastest of the methods for replacement tested. Additionally the approach in this answer generates pairs of exact matches and replacements, so we can set fixed = TRUE
in gsub()
, which with a five character pattern takes about a quarter of the time to make a replacement compared with fixed = FALSE
.
This does make several passes over the string, rather than some other answers which make one pass to look for a match. However those answers then apply logic after the match is found, whereas this has a one-to-one mapping of matches to replacements, so no logic is required. I suspect which is faster depends on the data, specifically how many variants of the pattern you have, and the language (it's generally quicker in R to do the regex several times, which is written in C, rather than the capital shift logic, which is written in R).
Is this still a workaround? Yes. But as a pure regex solution cannot exist, I like a solution that abstracts away the unseemly character level iteration, so I can forget it is a bit of a hack.