4

I have a group of variable var:

> var
[1] "a1" "a2" "a3" "a4"

here is what I want to achieve: using regex and change strings such as this:

 3*a1 + a1*a2 + 4*a3*a4 + a1*a3

to

 3a1 + a1*a2 + 4a3*a4 + a1*a3

Basically, I want to trim "*" that is not in between any values in var. Thank you in advance

BlueFeet
  • 2,407
  • 4
  • 21
  • 24
  • `gsub('(\\d)\\*(\\w)', '\\1\\2', '3*a + a*b + 4*c*d + a*c')` maybe – alistaire Apr 27 '16 at 16:52
  • In your "Change `x` to `y`", neither `x` nor `y` are objects in R, so it's not clear what you're asking. – Frank Apr 27 '16 at 16:53
  • Why negative votes? Do you have an answer for it? Or just being abusive while remaining unanimous? – BlueFeet Apr 27 '16 at 16:53
  • 2
    @Frank, it's a regex question as the title suggests. How to use "gsub" or other functions to achieve the goal, i.e. change ... to ... – BlueFeet Apr 27 '16 at 16:55
  • Ok, that makes sense. Maybe you're after something like: http://stackoverflow.com/questions/24173468/r-print-equation-of-linear-regression-on-the-plot-itself – Frank Apr 27 '16 at 16:58
  • Thank you @alistaire, I just modified my question to make it harder (and more realistic for my application) -- the variable names can have numbers, such as a1, a2, ... – BlueFeet Apr 27 '16 at 17:00
  • Just add a space before `\\d` (assuming it's formatted as presented; if not you can do a character range like `[ +-]` instead): `gsub('( \\d)\\*(\\w)', '\\1\\2', '3*a1 + a1*a2 + 4*a3*a4 + a1*a3')` – alistaire Apr 27 '16 at 17:01
  • @rawr, thank you for pointing it out. See my updated question -- the variable names can be anything alphanumeric (but not pure numbers.) – BlueFeet Apr 27 '16 at 17:01
  • Oops, that doesn't work with the beginning of the line. More complicated, but: `gsub('((?:^| )\\d)\\*(\\w)', '\\1\\2', '3*a1 + a1*a2 + 4*a3*a4 + a1*a3')` – alistaire Apr 27 '16 at 17:05

4 Answers4

3

Can do find (?<![\da-z])(\d+)\* replace $1

 (?<! [\da-z] )
 ( \d+ )                       # (1)
 \*

Or, ((?:[^\da-z]|^)\d+)\* for the assertion impaired engines

 (                             # (1 start)
      (?: [^\da-z] | ^ )
      \d+ 
 )                             # (1 end)
 \*

Leading assertions are bad anyways.

Benchmark

Regex1:   (?<![\da-z])(\d+)\*
Options:  < none >
Completed iterations:   100  /  100     ( x 1000 )
Matches found per iteration:   2
Elapsed Time:    1.09 s,   1087.84 ms,   1087844 µs


Regex2:   ((?:[^\da-z]|^)\d+)\*
Options:  < none >
Completed iterations:   100  /  100     ( x 1000 )
Matches found per iteration:   2
Elapsed Time:    0.77 s,   767.04 ms,   767042 µs
2

You can create a dynamic regex out of the var to match and capture *s that are inside your variables, and reinsert them back with a backreference in gsub, and remove all other asterisks:

var <- c("a1","a2","a3","a4")
s = "3*a1 + a1*a2 + 4*a3*a4 + a1*a3"
block = paste(var, collapse="|")
pat = paste0("\\b((?:", block, ")\\*)(?=\\b(?:", block, ")\\b)|\\*")
gsub(pat, "\\1", s, perl=T)
## "3a1 + a1*a2 + 4a3*a4 + a1*a3"

See the IDEONE demo

Here is the regex:

\b((?:a1|a2|a3|a4)\*)(?=\b(?:a1|a2|a3|a4)\b)|\*

Details:

  • \b - leading word boundary
  • ((?:a1|a2|a3|a4)\*) - Group 1 matching
    • (?:a1|a2|a3|a4) - either one of your variables
    • \* - asterisk
    • (?=\b(?:a1|a2|a3|a4)\b) - a lookahead check that there must be one of your variables (otherwise, no match is returned, the * is matched with the second branch of the alternation)
  • | - or
  • \* - a "wild" literal asterisk to be removed.
Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
1

Taking the equation as a string, one option is

gsub('((?:^| )\\d)\\*(\\w)', '\\1\\2', '3*a1 + a1*a2 + 4*a3*a4 + a1*a3')
# [1] "3a1 + a1*a2 + 4a3*a4 + a1*a3"

which looks for

  • a captured group of characters, ( ... )
    • containing a non-capturing group, (?: ... )
      • containing the beginning of the line ^
      • or, |
      • a space (or \\s)
    • followed by a digit 0-9, \\d.
  • The capturing group is followed by an asterisk, \\*,
  • followed by another capturing group ( ... )
    • containing an alphanumeric character \\w.

It replaces the above with

  • the first captured group, \\1,
  • followed by the second captured group, \\2.

Adjust as necessary.

alistaire
  • 42,459
  • 4
  • 77
  • 117
0

Thank @alistaire for offering a solution with non-capturing group. However, the solution replies on that there exists an space between the coefficient and "+" in front of it. Here's my modified solution based on his suggestion:

> ss <- "3*a1 + a1*a2+4*a3*a4 +2*a1*a3+ 4*a2*a3"
# my modified version
> gsub('((?:^|\\s|\\+|\\-)\\d)\\*(\\w)', '\\1\\2', ss) 
[1] "3a1 + a1*a2+4a3*a4 +2a1*a3+ 4a2*a3"

# alistire's
> gsub('((?:^| )\\d)\\*(\\w)', '\\1\\2', ss)
[1] "3a1 + a1*a2+4*a3*a4 +2*a1*a3+ 4a2*a3"
BlueFeet
  • 2,407
  • 4
  • 21
  • 24
  • It's a little confusing to see you need to be that specific in validate before the digit and why there is really a need to after the asterisk. Commonly the way to do it is to narrow it down to what is known, i.e. should not be before the digit, otherwise you imply you don't trust the string, in which case you should validate the entire string before doing the replacement. –  Apr 27 '16 at 22:53
  • The user can have a variable called "4a", thus I can't simply say " should not be before the digit" – BlueFeet May 25 '16 at 21:25