1

I'm trying to find whether a string pattern; (<double num><space><an operator><space><double num>) e.g. (14.0 + 46.0) exists in a given string using a regex in R. There can be 4 operators +,-,* and /.

There are two main patterns. Regex for the 1st pattern identifies that the pattern exists in the string "s"

#Pattern 1
s = "(14.0 + 46.0)"

#Regex
grep("^\\(-?\\d*\\.\\d{1}\\s[\\+\\-\\*\\/]\\s-?\\d*\\.\\d{1}\\)$", s)

I'm trying to find the same pattern in a different string s1 and s2. I modified the first regex by adding .* (any character) to the beginning and end of the string ("^.* .*$"). I have checked the regex in this online checker and it works. But it doesn't work in R studio.

#Pattern 2
s1 = "((5.0 - 50.0) - 15.0)"
s2 = "(15.0 - (5.0 - 50.0))"

#Regex
grep("^.*\\(-?\\d*\\.\\d{1}\\s[\\+\\-\\*\\/]\\s-?\\d*\\.\\d{1}\\).*$", s1)
SriniShine
  • 1,089
  • 5
  • 26
  • 46
  • 2
    What does "doesn't work" mean exactly? What's the desired output? What are you even trying to match? – MrFlick Dec 07 '17 at 20:25
  • Doesn't work means it doesn't pick up that the search string s1/s2 has a pattern matching to the pattern specified in the regex. – SriniShine Dec 07 '17 at 20:32
  • But you didn't describe in words what you are trying to match. According to a different regex tester, [your expression is invalid](https://regex101.com/r/EcSyYx/2) (missing parenthesis). – MrFlick Dec 07 '17 at 20:36
  • Sorry about that. I modified the question. – SriniShine Dec 07 '17 at 20:42
  • parenthesis was a typo. My expression is correct in the "^.*\(-?\d*\.\d{1}\s[\+\-\*\/]\s-?\d*\.\d{1}\).*$" but still it's not working in R. – SriniShine Dec 07 '17 at 20:51

2 Answers2

7

Brief

Just to explain why I made so many changes to your regex (I actually just rewrote it).

  1. You use {1}. While this is valid, it is redundant, so the {1} can be removed.
  2. You don't need to escape every character in a list, only specific ones (i.e. slash and hyphen, but only hyphen when it's not at the start/end of the set or after a range - so I moved it to the start of the set).
  3. Your regex allows .1 to be valid, not sure if that was intentional, and if it was you can edit my regex to your liking. I just felt that a more correct solution would force a number before the . such that .1 is invalid, but 0.1 is valid.
  4. You have repeating parts in your pattern so I changed these to named capture groups. This allows the pattern to be very easily manipulated to your liking. It also allows you to define pattern parts in one location instead of multiple spots. - Recursion
  5. Recursion is the only way (or balancing groups in C#) that I know of that you can properly determine matched open/closing tags (in this case left and right parentheses). The g group in my pattern handles the recursion.

Code

See regex in use here

(?(DEFINE)
  (?<n>[-+]?\d+(?:\.\d+)?)
  (?<a>\s*[-+*\/]\s*)
  (?<g>\((?:(?&n)|(?&g))(?&a)(?:(?&n)|(?&g))\))
)
^(?&g)$

Flags: gmx

Usage

See the code in use here

r <- "(?(DEFINE)(?<n>[-+]?\\d+(?:\\.\\d+)?)(?<a>\\s*[-+*\\/]\\s*)(?<g>\\((?:(?&n)|(?&g))(?&a)(?:(?&n)|(?&g))\\)))^(?&g)$"
x <- c("(14.0 + 46.0)", "((5.0 - 50.0) - 15.0)", "(15.0 - (5.0 - 50.0))", "(15.0 - (5.0 - 50.0)")
grep(r, x, perl=TRUE)

Results

Input

(14.0 + 46.0)
((5.0 - 50.0) - 15.0)
(15.0 - (5.0 - 50.0))

Output

Only matches shown below.

(14.0 + 46.0)
((5.0 - 50.0) - 15.0)
(15.0 - (5.0 - 50.0))

Explanation

  • (?(DEFINE)) Subpattern definition construct. This is completely ignored by regex. It gets treated as a var name="value", whereas you can recall the specific pattern for use via its name.
  • (?<n>[-+]?\d+(?:\.\d+)?) Subpattern n defines a valid number as follows
    • [-+]? Match zero or one of any character in the set -+
    • \d+ Match any digit one or more times
    • (?:\.\d+)? Match zero or one of a literal dot . followed by one or more digits
  • (?<a>\s*[-+*\/]\s*) Subpattern a defines all valid arithmetic symbols
    • \s* Match any number of whitespace characters
    • [-+*\/] Match a character in the set -+*/
    • \s* Match any number of whitespace characters
  • (?<g>\((?:(?&n)|(?&g))(?&a)(?:(?&n)|(?&g))\)) Match the following
    • \( Match a literal left parenthesis (
    • (?:(?&n)|(?&g)) Match either the n or g patterns (recursion)
    • (?&a) Match the a pattern (recursion)
    • (?:(?&n)|(?&g)) Match either the n or g patterns (recursion)
    • \) Match a literal right parenthesis )
  • ^(?&g)$ Match the following
    • ^ Assert position at the end of the line
    • (?&g) Match the g pattern (recursion)
    • $ Assert position at the end of the line
ctwheels
  • 21,901
  • 9
  • 42
  • 77
  • @ctwheels Thank you for your detailed answer. It looks alot different to mine. Do you mean that the way I wrote it is wrong and that is why it is not working? Or is this an alternative answer? – SriniShine Dec 07 '17 at 20:56
  • 1
    @SriniShine see the **Brief** section I just added to my answer – ctwheels Dec 07 '17 at 21:04
  • 1
    @ctwheels Thank you very much. Your answer is very neat! However I wanted to find out why my regex didn't work and useR provided the solution. So I accepted his solution as the answer. Thank you once again for the concise answer. – SriniShine Dec 07 '17 at 21:13
0

Note that adding .* at both ends of the regex is dangerous, since .* is greedy and will match any character. You can either attach ? (which makes .* lazy) or remove .* altogether since grep searches for any match in any parts of the string. To answer your question, it seems that you have to turn on perl to enable grep to match correctly:

#Pattern 2
s1 = "((5.0 - 50.0) - 15.0)"
s2 = "(15.0 - (5.0 - 50.0))"

grep("^.*?\\(-?\\d*\\.\\d{1}\\s[\\+\\-\\*\\/]\\s-?\\d*\\.\\d{1}\\).*?$", s1)
# integer(0)

grep("\\(-?\\d+\\.\\d\\s[\\+\\-\\*\\/]\\s-?\\d+\\.\\d\\)", s1)
# integer(0)

grep("^.*?\\(-?\\d*\\.\\d{1}\\s[\\+\\-\\*\\/]\\s-?\\d*\\.\\d{1}\\).*?$", s1, perl = TRUE)
# [1] 1

grep("\\(-?\\d+\\.\\d\\s[\\+\\-\\*\\/]\\s-?\\d+\\.\\d\\)", s1, perl = TRUE)
# [1] 1

As pointed out by @ctwheels, this will match any string that has the pattern (<double num><space><an operator><space><double num>) in any position of the string. So this does not serve to validate whether the string contains only valid characters. See @ctwheels's answer for the latter case.

acylam
  • 18,231
  • 5
  • 36
  • 45
  • Thank you for your answer. But the previous one (regex for pattern 1) worked without turning on perl though. – SriniShine Dec 07 '17 at 21:05
  • @ctwheels the one with .* is my solution. The regex worked in the expression checker but not in R. But when I turned on perl it did work. – SriniShine Dec 07 '17 at 21:21
  • @ctwheels I wanted to see whether the pattern "()" exists in a given string. I wanted to find out why the regex I wrote (I admit it is not the most efficient one) did work in the regex evaluator but not in R. – SriniShine Dec 07 '17 at 22:04
  • 2
    This answer ensures the pattern of in the form of `(1.1 - 1.1)` exists in the string. The pattern does not validate nesting of this pattern, but simply **checks its existence** in a given string. If that is the intention, use this answer. If the intention is to validate the string do not use this answer. For validation or nested validation see [my answer](https://stackoverflow.com/a/47703242/3600709) instead. – ctwheels Dec 07 '17 at 22:06