3

My objective would be replacing a string by a symbol repeated as many characters as have the string, in a way as one can replace letters to capital letters with \\U\\1, if my pattern was "...(*)..." my replacement for what is captured by (*) would be something like x\\q1 or {\\q1}x so I would get so many x as characters captured by *.

Is this possible?

I am thinking mainly in sub,gsub but you can answer with other libraris like stringi,stringr, etc. You can use perl = TRUE or perl = FALSE and any other options with convenience.

I assume the answer can be negative, since seems to be quite limited options (?gsub):

a replacement for matched pattern in sub and gsub. Coerced to character if possible. For fixed = FALSE this can include backreferences "\1" to "\9" to parenthesized subexpressions of pattern. For perl = TRUE only, it can also contain "\U" or "\L" to convert the rest of the replacement to upper or lower case and "\E" to end case conversion. If a character vector of length 2 or more is supplied, the first element is used with a warning. If NA, all elements in the result corresponding to matches will be set to NA. 

Main quantifiers are (?base::regex):

?

    The preceding item is optional and will be matched at most once.
*

    The preceding item will be matched zero or more times.
+

    The preceding item will be matched one or more times.
{n}

    The preceding item is matched exactly n times.
{n,}

    The preceding item is matched n or more times.
{n,m}

    The preceding item is matched at least n times, but not more than m times.

Ok, but it seems to be an option (which is not in PCRE, not sure if in PERL or where...) (*) which captures the number of characters the star quantifier is able to match (I found it at https://www.rexegg.com/regex-quantifier-capture.html) so then it could be used \q1 (same reference) to refer to the first captured quantifier (and \q2, etc.). I also read that (*) is equivalent to {0,} but I'm not sure if this is really the fact for what I'm interested in.

EDIT UPDATE:

Since asked by commenters I update my question with an specific example provide by this interesting question. I modify a bit the example. Let's say we have a <- "I hate extra spaces elephant" so we are interested in keeping the a unique space between words, the 5 first characters of each word (till here as the original question) but then a dot for each other character (not sure if this is what is expected in the original question but doesn't matter) so the resulting string would be "I hate extra space. eleph..." (one . for the last s in spaces and 3 dots for the 3 letters ant in the end of elephant). So I started by keeping the 5 first characters with

gsub("(?<!\\S)(\\S{5})\\S*", "\\1", a, perl = TRUE)
[1] "I hate extra space eleph"

How should I replace the exact number of characters in \\S* by dots or any other symbol?

Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
iago
  • 2,990
  • 4
  • 21
  • 27
  • Please show a specific problem including input and expected output. – G. Grothendieck Oct 29 '20 at 15:58
  • You could make something work using `regexpr` to identify the match position and length, and then use `substr<-` to replace it. So that might be a good way to accomplish your goal. But if your question is "do the existing regex functions have the capability" the answer is no. – Gregor Thomas Oct 29 '20 at 15:59
  • What you posted is an XY problem. There must be other ways to solve the issue, what is it, by the way? `\\L\\1` will lowercase, not uppercase the Group 1 value. I think you are asking about something like `gsub("(?:\\G(?!^)|\\()\\K[^()](?=[^()]*\\))", "x", "(888) 45 78 44", perl=TRUE)`, see https://ideone.com/etIb9S – Wiktor Stribiżew Oct 29 '20 at 15:59
  • Your question would be a better one if you edited it for tightness I appreciate that you've done research on this, but quoting the help pages is usually much more appropriate in Answers than Questions. It would probably suffice to say "I"ve read about quantifiers at `?base::regex` but not found anything about using them in replacements". – Gregor Thomas Oct 29 '20 at 16:01
  • @G.Grothendieck I edited my question with specific problem – iago Oct 29 '20 at 16:17
  • @WiktorStribiżew I updated my question with X problem. – iago Oct 29 '20 at 16:17
  • @GregorThomas Thanks for the answer in the first comment. I do not understand your second comment. I updated my question with an specific example. – iago Oct 29 '20 at 16:19
  • My second comment is trying to say that your actual question starts at the paragraph "My objective...". Going on a great length *before* getting to the point and copy/pasting content from the `regex` help page doesn't add much that is useful to your question. Rather, it detracts from your question by burying the lede, and makes it less likely that people will find this question a useful resource. – Gregor Thomas Oct 29 '20 at 16:26
  • I would recommend editing your question to delete everything above the "My objective..." paragraph and instead summarize it as *"I hoped I could do this with, e.g., gsub, but reading about quantifiers at ?base::regex I didn't find anything about using them in replacements."* – Gregor Thomas Oct 29 '20 at 16:27
  • @GregorThomas You are right. I place the main paragraph to the beginning of the question, but I keep the other as I believe it is important, since what I am asking for is partially about the existence and possible use with replacement of that possibilities mentioned in the linked webpage, `(*)` and `\q1`. – iago Oct 29 '20 at 17:02

2 Answers2

0

Quantifiers cannot be used in the replacement pattern, nor the information how many chars they match.

What you need is a \G base PCRE pattern to find consecutive matches after a specific place in the string:

a <- "I hate extra spaces elephant"
gsub("(?:\\G(?!^)|(?<!\\S)\\S{5})\\K\\S", ".", a, perl = TRUE)

See the R demo and the regex demo.

Details

  • (?:\G(?!^)|(?<!\S)\S{5}) - the end of the previous successful match or five non-whitespace chars not preceded with a non-whitespace char
  • \K - a match reset operator discarding text matched so far
  • \S - any non-whitespace char.
Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
0

gsubfn is like gsub except the replacement string can be a function which inputs the match and outputs the replacement. The function can optionally be expressed a formula as we do here replacing each string of word characters with the output of the function replacing that string. No complex regular expressions are needed.

library(gsubfn)

gsubfn("\\w+", ~ paste0(substr(x, 1, 5), strrep(".", max(0, nchar(x) - 5))), a)
## [1] "I hate extra space. eleph..."

or almost the same except function is slightly different:

gsubfn("\\w+", ~ paste0(substr(x, 1, 5), substring(gsub(".", ".", x), 6)), a)
## [1] "I hate extra space. eleph..."
G. Grothendieck
  • 254,981
  • 17
  • 203
  • 341