13

MAIN QUESTION:

Is there a regex for preserving case pattern in the vein of \U and \L?
Ideally, it would also respect word boundaries and anchors.

Example

Assume we have a large body of text where we want to convert one word to another, while preserving the capitalization of the word. For example, replacing all instances of "date" with "month"

 Input: `"This Date is a DATE that is daTe and date."`
Output: `"This Month is a MONTH that is moNth and month."`
 input     output
------     -------
"date" ~~> "month"
"Date" ~~> "Month"
"DATE" ~~> "MONTH"
"daTe" ~~> "moNth"   ## This example might be asking for too much.

Preserving word boundaries

I'd be interested in a solution that preserves word boundaries (ie, is able to match "whole word" only). In the given example, "date" would be changed, but not "dated"


Existing Workaround in R:

I currently use three nested calls to sub to accomplish this.

input <- c("date", "Date", "DATE")
expected.out <- c("month", "Month", "MONTH")

sub("date", "month", 
  sub("Date", "Month", 
    sub("DATE", "MONTH", input)
  )
)

The goal is to have a single pattern and a single replace such as

gsub("(date)", "\\Umonth", input, perl=TRUE) 

which will yield the desired output


Notes (updated 2023)

  1. The motivation behind the question is to expand knowledge about the capabilities of RegEx. The below example is given only as an illustration. The purpose of this question is not to find alternate workarounds.
  2. The question was asked with the R tag, but would accept answers that invoke flavors of RegEx not currently available in R
Ricardo Saporta
  • 54,400
  • 17
  • 144
  • 178
  • 1
    Why not just use a map via a named vector: `map <- setNames(expected.output, input)`. Then do `month <- map[date]`. – flodel Oct 03 '14 at 00:11
  • @flodel - smart thinking - there's really no need for any regex here. – thelatemail Oct 03 '14 at 00:14
  • 1
    @flodel -- I suspect Ricardo is also wanting a solution that'll work for inputs like `input <- "Here are a date, a Date, and a DATE"` – Josh O'Brien Oct 03 '14 at 00:17
  • yes, exactly @JoshO'Brien. Flodel, I was trying to simplify the example for the sake of the question. Perhaps I oversimplified it – Ricardo Saporta Oct 03 '14 at 00:24
  • Looks like an `gsubfn::strapply` problem. Calling @G.Grothendeick. – IRTFM Oct 03 '14 at 00:28
  • 1
    My gut says you can't do it with a single regex; use a `for` loop or get fancy with a `Reduce`. – flodel Oct 03 '14 at 00:31
  • @flodel, the mapped vector probably makes the most sense if there are no alternate options. My goal however was to avoid having to create multiple capitalization-versions of the same word – Ricardo Saporta Oct 03 '14 at 03:58
  • Ricardo, is there a particular reason you left this with no accepted answers? I feel there are credible suggestions in all of them. – r2evans Apr 03 '23 at 02:43
  • 1
    @r2evans To the best of my knowledge, there is no way to do what I was asking for with RegEx. The answers given simply offer alternate workarounds. The core question is: "Is there a regex for preserving case pattern in the vein of \U and \L?" AFAIK, the answer is "No," although I have just added a bounty to the question. Thanks for the ping on this :) – Ricardo Saporta Jun 16 '23 at 20:57
  • Gotcha, thanks @RicardoSaporta – r2evans Jun 16 '23 at 21:15
  • A perl solution to a similar problem can be found [here](https://stackoverflow.com/questions/8013625/how-can-i-use-a-regular-expression-to-replace-matches-preserving-case). You might be able to extend this to your case – shs Jun 17 '23 at 13:51

6 Answers6

13

This is one of those occasions when I think a for loop is justified:

input <- rep("Here are a date, a Date, and a DATE",2)
pat <- c("date", "Date", "DATE")
ret <- c("month", "Month", "MONTH")

for(i in seq_along(pat)) { input <- gsub(pat[i],ret[i],input) }
input
#[1] "Here are a month, a Month, and a MONTH" 
#[2] "Here are a month, a Month, and a MONTH"

And an alternative courtesy of @flodel implementing the same logic as the loop through Reduce:

Reduce(function(str, args) gsub(args[1], args[2], str), 
       Map(c, pat, ret), init = input)

For some benchmarking of these options, see @TylerRinker's answer.

thelatemail
  • 91,185
  • 12
  • 128
  • 188
8

Here's a qdap approach. Pretty straight forward but not the fastest:

input <- rep("Here are a date, a Date, and a DATE",2)
pat <- c("date", "Date", "DATE")
ret <- c("month", "Month", "MONTH")


library(qdap)
mgsub(pat, ret, input)

## [1] "Here are a month, a Month, and a MONTH"
## [2] "Here are a month, a Month, and a MONTH"

Benchmarking:

input <- rep("Here are a date, a Date, and a DATE",1000)

library(microbenchmark)

(op <- microbenchmark( 
    GSUBFN = gsubfn('date', list('date'='month','Date'='Month','DATE'='MONTH'), 
             input, ignore.case=T),
    QDAP = mgsub(pat, ret, input),
    REDUCE = Reduce(function(str, args) gsub(args[1], args[2], str), 
       Map(c, pat, ret), init = input),
    FOR = function() {
       for(i in seq_along(pat)) { 
          input <- gsub(pat[i],ret[i],input) 
       }
       input
    },

times=100L))

## Unit: milliseconds
##    expr        min         lq     median         uq        max neval
##  GSUBFN 682.549812 815.908385 847.361883 925.385557 1186.66743   100
##    QDAP  10.499195  12.217805  13.059149  13.912157   25.77868   100
##  REDUCE   4.267602   5.184986   5.482151   5.679251   28.57819   100
##     FOR   4.244743   5.148132   5.434801   5.870518   10.28833   100

enter image description here

Tyler Rinker
  • 108,132
  • 65
  • 322
  • 519
  • 1
    I want to select this as the answer, simply for the benchmarks :) – Ricardo Saporta Oct 03 '14 at 02:13
  • The `qdap` approach is slower because it does some reordering of the patterns to make sure more/larger n character subs/replacements come first to be less likely that they will be overwritten by the smaller replacements first. If that doesn't make sense just realize there's built in protections. – Tyler Rinker Oct 03 '14 at 02:40
7

Using the gsubfn package, you could avoid using nested sub functions and do this in one call.

> library(gsubfn)
> x <- 'Here we have a date, a different Date, and a DATE'
> gsubfn('date', list('date'='month','Date'='Month','DATE'='MONTH'), x, ignore.case=T)
# [1] "Here we have a month, a different Month, and a MONTH"
hwnd
  • 69,796
  • 4
  • 95
  • 132
  • the replacement argument of the gsubfn() call is the list of three substitutions that depend on 'date's' capitalization. But can you explain why that is list(...) is a function that R understands as making the substitutions? Sorry if that is not clear. Maybe you could explain what that one call is doing. Thank you – lawyeR Oct 03 '14 at 01:43
  • @lawyeR - since `ignore.case=TRUE` the function matches the pattern `date` to (`date` or `Date` or `DATE`) and then looks up what the match was in the replacement `list(...)`. So, if `Date` was matched, it grabs `list(..)[["Date"]]` which is `Month` in this case. – thelatemail Oct 03 '14 at 01:57
6

AFAIK there is no way to do what you have asked with a pure regex and a single(*) find and replace. The problem is the replacing part can only use capturing group matches as-is - it can't process them, derive information from them or do conditionals without a function being involved. So even if you use something like \b(?:(d)|(D))(?:(a)|(A))(?:(t)|(T))(?:(e)|(E))\b in a case-sensitive find (so evenly numbered captures are upper-case and oddly numbered captures are lower-case - see "MATCH INFORMATION" in the right pane of regex101), the replace part still needs a function to act on this captured information.

(*) Am assuming you don't want to perform separate find and replaces for every single combination of uppers and lowers!

Appendix

I could stop there given you've made it very clear you aren't interested in other solutions... but just for fun thought I'd try a Javascript solution (which includes the function processing as part of the regex replacment):

const text = `This Date is a DATE that is daTe and date.
But dated should not be replaced, and nor should sedate.`;

const find = "date", replace = "month";
// For the general case, could apply a regex escaping function to `find` here.
// See https://stackoverflow.com/questions/3561493

const result = text.replace(new RegExp(`\\b${find}\\b`, "gi"), match => {
  let rep = "", pos = 0, upperCase = false;
  for (; pos < find.length && pos < replace.length; pos++) {
    const matchChar = match.charAt(pos);
    upperCase = matchChar.toUpperCase() === matchChar;
    const repChar = replace.charAt(pos);
    rep += upperCase ? repChar.toUpperCase() : repChar.toLowerCase();
  }
  const remaining = replace.substring(pos);
  rep += upperCase ? remaining.toUpperCase() : remaining.toLowerCase();
  return rep;
});

console.log(result);
Steve Chambers
  • 37,270
  • 24
  • 156
  • 208
  • 3
    This is a nice explanation of why you need some logic for the replacement - and I like that you can do this fairly elegantly in JS. FYI the snippet you posted doesn't seem to make the replacements (the first sentence of the output is still `"This Date is a DATE that is daTe and date."`) If you remove the word boundaries from the pattern it works (though obviously it then makes the unwanted replacements in the second sentence). – SamR Jun 19 '23 at 17:20
  • 1
    Oops! Bad final edit, it was working earlier - the problem was backslashes not being escaped, have now amended. – Steve Chambers Jun 19 '23 at 19:21
5

You have to write some logic

You will not find a pure regex solution for this. Similar SO questions in C# and JS contain extensive logical flow to determine which characters are capitals.

Furthermore, these questions have additional constraints which make them considerably simpler than your question:

  1. The pattern and the replacement are the same length.
  2. Each character in the pattern has a unique replacement character, e.g. "abcd" => "wxyz".

As a response to a similar question on the Rust reddit states:

There's a lot of possible ways that this could go wrong. For example, what should happen if you try to replace with a different number of characters ("abc" -> "wxyz")? What if you have a mapping with multiple outgoing links ("aaa" -> "xyz")?

This is precisely what you are trying to do. Where the pattern and replacement are a different length, in general you want the index of each capital in the pattern to be mapped to the index in the replacement, e.g. "daTe" => ""moNth. However, sometimes you do not, e.g. "DATE" => "MONTH", and not "MONTh". Even if there were a regex flavour with some sort of \U equivalent (which is a nice question), to cope with patterns and replacements with different lengths, regex cannot be enough.

Another complication is the letters in the pattern or replacement are not guaranteed to be unique: you want to be able to replace "WEEK" with "MONTH" and vice versa. This rules out character hash map approaches like the Rust answer. The Perl response linked in comments can cope with different length replacements. However, to generalise it to more than just the first letter would require a pattern setting out all possible permutations of capitals and lower case letters. This would be at least 2^n patterns, where n is the number of letters in the word being replaced. This doesn't get you much further than doing the same in R or any language.

R solution

I have written a function swap(), which will do this for you with two strings, even with different numbers of letters:

x <- "This Date is a DATE that is daTe and date."
swap("date", "month", x)
# [1] "This Month is a MONTH that is moNth and month."

How it works

The swap() function uses Reduce() in a pretty similar way to this answer:

swap <- function(old, new, str, preserve_boundaries = TRUE) {
    l <- create_replacement_pairs(old, new, str, preserve_boundaries)
    Reduce(\(x, l) gsub(l[1], l[2], x, fixed = TRUE), l, init = str)
}

The workhorse function is create_replacement_pairs(), which creates a list of pairs of patterns that actually appears in the string, e.g. c("daTe", "DATE"), and generates replacements with the correct case, e.g. c("moNth", "MONTH"). The function logic is:

  1. Find all matches in the string, e.g. "Date" "DATE" "daTe" "date".
  2. Create a boolean mask indicating whether each letter is a capital.
  3. If all letters are capitals, the replacement should also be all caps, e.g. "DATE" => "MONTH". Otherwise make the letter at each index in the replacement a capital if the letter at the corresponding index in the pattern is a capital.
create_replacement_pairs <- function(old = "date", new = "month", str, preserve_boundaries) {
    if (preserve_boundaries) {
        pattern <- paste0("\\b", old, "\\b")
    } else {
        pattern <- old
    }

    matches <- unique(unlist(
        regmatches(str, gregexpr(pattern, str, ignore.case = TRUE))
    )) # e.g. "Date" "DATE" "daTe" "date"

    capital_shift <- lapply(matches, \(x) {
        out_length <- nchar(new)
        # Boolean mask if <= capital Z
        capitals <- utf8ToInt(x) <= 90

        # If e.g. DATE, replacement should be
        # MONTH and not MONTh
        if (all(capitals)) {
            shift <- rep(32, out_length)
        } else {
            # If not all capitals replace corresponding
            # index with capital e.g. daTe => moNth

            # Pad with lower case if replacement is longer
            length_diff <- max(out_length - nchar(old), 0)
            shift <- c(
                ifelse(capitals, 32, 0),
                rep(0, length_diff)
            )[1:out_length] # truncate if replacement shorter than pattern
        }
    })

    replacements <- lapply(capital_shift, \(x) {
        paste(vapply(
            utf8ToInt(new) - x,
            intToUtf8,
            character(1)
        ), collapse = "")
    })

    replacement_list <- Map(\(x, y) c(old = x, new = y), matches, replacements)

    replacement_list
}

Use cases

This approach is not subject to the same constraints as the Rust and C# answers linked at the start of this answer. We have already seen this works where the replacement is longer than the pattern. The converse is also true:

swap("date", "day", x)
# [1] "This Day is a DAY that is daY and day."

Furthermore, as it does not use a hash map, it works in cases where the letters in the replacement are not unique.

swap("date", "week", x)
# [1] "This Week is a WEEK that is weEk and week."

It also works where the letters in the pattern are not unique:

swap("that", "which", x)
# [1] "This Date is a DATE which is daTe and date."

Edit: Thanks to @shs for pointing out in the comments that this did not preserve word boundaries. It now does by default, but you can disable this with preserve_boundaries = FALSE:

swap("date", "week", "this dAte is dated", preserve_boundaries = FALSE)
# [1] "this wEek is weekd"
swap("date", "week", "this dAte is dated")
# [1] "this wEek is dated"

Performance

Dynamically generating matches from the lower case arguments in this way will not be quite as fast as hardcoding list(c("Date", "Month"), c("DATE", "MONTH"), c("daTe", "moNth"), c("date", "month")). However a fair comparison should probably include the time it takes to type that list, which I doubt could be done in less than the ten-thousandth of a second the function takes to return, even by the most committed vim user.

I had the benefit of seeing the benchmarks in Tyler Rinker's answer so have used Reduce() and gsub(), which is the fastest of the methods for replacement tested. Additionally the approach in this answer generates pairs of exact matches and replacements, so we can set fixed = TRUE in gsub(), which with a five character pattern takes about a quarter of the time to make a replacement compared with fixed = FALSE.

This does make several passes over the string, rather than some other answers which make one pass to look for a match. However those answers then apply logic after the match is found, whereas this has a one-to-one mapping of matches to replacements, so no logic is required. I suspect which is faster depends on the data, specifically how many variants of the pattern you have, and the language (it's generally quicker in R to do the regex several times, which is written in C, rather than the capital shift logic, which is written in R).

Is this still a workaround? Yes. But as a pure regex solution cannot exist, I like a solution that abstracts away the unseemly character level iteration, so I can forget it is a bit of a hack.

SamR
  • 8,826
  • 3
  • 11
  • 33
  • 1
    This solution also does not respect word boundaries. Otherwise nice work though. – shs Jun 17 '23 at 19:20
  • It should also be stated that this use of `utf8ToInt()` only works for the standard latin alphabet. For example, `ÄÖÜ` in German words would be problematic. – shs Jun 17 '23 at 19:28
  • @shs thanks re `str` and `x` - fixed. Re locale, yes you're right. I thought standard Latin alphabet was implied (the other alphabets I know don't have the concept of capitals), but good point about accented letters. That can be fixed if a character set is specified (though it becomes a bit inelegant as `intToUtf8(utf8ToInt("Ä") - 32)` is not `"ä"`. Re word boundaries, I'm not sure exactly what you mean. Can you give me an example? – SamR Jun 17 '23 at 19:30
  • 1
    It's explained in the question under the heading "Preserving word boundaries". If you do `perl = T` a pattern like `"\bdate\b"` would ensure that. With the default POSIX patterns I don't know how it would work – shs Jun 17 '23 at 19:56
3

edit: Note that the way shown here is a single regex, single pass solution.
It avoids re-searching the string for individual separate forms.
It should represent the fastest method to do such a thing.
The point of this question is speed, which is otherwise trivial.

This is written in Perl.
It has a function that takes the find word, replace word, default replace word,
and the string to do the replacing on.

This is fairly simple.
The function generates the four forms of each word, puts them into arrays,
constructs the regex based on the find word forms, then does a string replacement
of the passed in string.

The replacement is based on the capture group that matched.
The group number is used as an index into the replacement array to fetch the
equivalent form word.

There is a default replacement passed into this function that will be used
when the find word matches in a case insensitive way, the last group.

Even though done in Perl here it is easy to port to any language/regex engine.

use strict;
use warnings;


sub CreateForms{
   my ($wrd) = @_;
   my $w1 =  lc($wrd);                 # 1. lower case
   (my $w2 = $w1) =~ s/^(.)/uc($1)/e;  # 2. upper first letter only
   my $w3 =  uc($w1);                  # 3. upper case
   my $w4 = $w1;                       # 4. default (all the rest)
   my @forms = ("", $w1, $w2, $w3, $w4);
   return( @forms );
}

sub ReplaceForms{
   my ($findwrd, $replwrd, $replDefault, $input) = @_;

   my @ff = CreateForms($findwrd);
   my $Rx = "\\b(?:(" . $ff[1] . ")|(" . $ff[2] . ")|(" . $ff[3] . ")|((?i)" . $ff[4] . "))\\b";

   my @rr = CreateForms($replwrd);
   $rr[4] = $replDefault;

   $input =~ s/$Rx/ $rr[defined($1) ? 1 : defined($2) ? 2 : defined($3) ? 3 : 4]/eg;
   return $input;

};
 
print "\n";
print ReplaceForms( "date", "month", "monTh", "this is the date of the year" ), "\n";
print ReplaceForms( "date", "month", "monTh", "this is the Date of the year" ), "\n";
print ReplaceForms( "date", "month", "monTh", "this is the DATE of the year" ), "\n";
print ReplaceForms( "date", "month", "monTh", "this is the DaTe of the year" ), "\n";

Output

this is the month of the year
this is the Month of the year
this is the MONTH of the year
this is the monTh of the year
sln
  • 2,071
  • 1
  • 3
  • 11