15

I have a math expression, for example:

((2-x+3)^2+(x-5+7)^10)^0.5

I need to replace the ^ symbol to pow function of C language. I think that regex is what I need, but I don't know a regex like a pro. So I ended up with this regex:

(\([^()]*)*(\s*\([^()]*\)\s*)+([^()]*\))*

I don't know how to improve this. Can you advice me something to solve that problem?

The expected output:

pow(pow(2-x+3,2)+pow(x-5+7,10),0.5)
M--
  • 25,431
  • 8
  • 61
  • 93
Kalinkin Alexey
  • 175
  • 1
  • 8
  • What is the final expected output? – Wiktor Stribiżew Nov 15 '16 at 09:52
  • 16
    Regex isn't a good choice here, because it doesn't handle nested parentheses well. Really, this is the job for a parser (at least in the general case). – Tim Biegeleisen Nov 15 '16 at 09:56
  • I don't work with parser in R to this day, I think. What parser do you recommend? – Kalinkin Alexey Nov 15 '16 at 10:11
  • 3
    @Алексей, Please specify if you need to deal with *any* expressions so that I could precise my answer, or you could select a better answer as accepted if you think they meet your requirements more, or at least let future readers know why this answer is accepted and not another. E.g. should `x^2` be turned to `pow(x, 2)`? Should the `y^x` be handled, too? – Wiktor Stribiżew Nov 15 '16 at 15:08
  • Or do you have a requirement - as I understood it from the start - to only process `^` that have parentheses before it? – Wiktor Stribiżew Nov 15 '16 at 15:16
  • Oh, so many answers! Well, my job is to convert the ^ to pow. And I need to place the chunk of code inside my function. For the first I think about regex, because I don't see any solution(to this time) on R code. But your answer I think is handling all situations that may be occur inside my parsing file. – Kalinkin Alexey Nov 15 '16 at 17:16
  • 2
    "First you have a problem. When you find a regex for it, then you have two problems". – Bradley Thomas Nov 15 '16 at 18:32
  • I think you should not use a regex in this case. Treat my answer only as an educational piece of information that one may need in case specific contexts should be handled. – Wiktor Stribiżew Nov 17 '16 at 13:50

4 Answers4

18

One of the most fantastic things about R is that you can easily manipulate R expressions with R. Here, we recursively traverse your expression and replace all instances of ^ with pow:

f <- function(x) {
  if(is.call(x)) {
    if(identical(x[[1L]], as.name("^"))) x[[1L]] <- as.name("pow")
    if(length(x) > 1L) x[2L:length(x)] <- lapply(x[2L:length(x)], f)
  }
  x
}
f(quote(((2-x+3)^2+(x-5+7)^10)^0.5))

## pow((pow((2 - x + 3), 2) + pow((x - 5 + 7), 10)), 0.5)

This should be more robust than the regex since you are relying on the natural interpretation of the R language rather than on text patterns that may or may not be comprehensive.


Details: Calls in R are stored in list like structures with the function / operator at the head of the list, and the arguments in following elements. For example, consider:

exp <- quote(x ^ 2)
exp
## x^2
is.call(exp)
## [1] TRUE

We can examine the underlying structure of the call with as.list:

str(as.list(exp))
## List of 3
##  $ : symbol ^
##  $ : symbol x
##  $ : num 2

As you can see, the first element is the function/operator, and subsequent elements are the arguments to the function.

So, in our recursive function, we:

  • Check if an object is a call
    • If yes: check if it is a call to the ^ function/operator by looking at the first element in the call with identical(x[[1L]], as.name("^")
      • If yes: replace the first element with as.name("pow")
      • Then, irrespective of whether this was a call to ^ or anything else:
        • if the call has additional elements, cycle through them and apply this function (i.e. recurse) to each element, replacing the result back into the original call (x[2L:length(x)] <- lapply(x[2L:length(x)], f))
    • If no: just return the object unchanged

Note that calls often contain the names of functions as the first element. You can create those names with as.name. Names are also referenced as "symbols" in R (hence the output of str).

BrodieG
  • 51,669
  • 9
  • 93
  • 146
14

DISCLAIMER: The answer was written with the OP original regex in mind, when the question sounded as "process the ^ preceded with balanced (nested) parentheses". Please do not use this solution for generic math expression parsing, only for educational purposes and only when you really need to process some text in the balanced parentheses context.

Since a PCRE regex can match nested parentheses, it is possible to achieve in R with a mere regex in a while loop checking the presence of ^ in the modified string with x <- grepl("(\\(((?:[^()]++|(?1))*)\\))\\^(\\d*\\.?\\d+)", v, perl=TRUE). Once there is no ^, there is nothing else to substitute.

The regex pattern is

(\(((?:[^()]++|(?1))*)\))\^(\d*\.?\d+)

See the regex demo

Details:

  • (\(((?:[^()]++|(?1))*)\)) - Group 1: a (...) substring with balanced parentheses capturing what is inside the outer parentheses into Group 2 (with ((?:[^()]++|(?1))*) subpattern) (explanation can be found at How can I match nested brackets using regex?), in short, \ matches an outer (, then (?:[^()]++|(?1))* matches zero or more sequences of 1+ chars other than ( and ) or the whole Group 1 subpattern ((?1) is a subroutine call) and then a ))
  • \^ - a ^ caret
  • (\d*\.?\d+) - Group 3: an int/float number (.5, 1.5, 345)

The replacement pattern contains a literal pow() and the \\2 and \\3 are backreferences to the substrings captured with Group 2 and 3.

R code:

v <- "((2-x+3)^2+(x-5+7)^10)^0.5"
x <- grepl("(\\(((?:[^()]++|(?1))*)\\))\\^(\\d*\\.?\\d+)", v, perl=TRUE)
while(x) {
    v <- sub("(\\(((?:[^()]++|(?1))*)\\))\\^(\\d*\\.?\\d+)", "pow(\\2, \\3)", v, perl=TRUE);
    x <- grepl("(\\(((?:[^()]++|(?1))*)\\))\\^(\\d*\\.?\\d+)", v, perl=TRUE)
}
v
## => [1] "pow(pow(2-x+3, 2)+pow(x-5+7, 10), 0.5)"

And to support ^(x-3) pows, you may use

 v <- sub("(\\(((?:[^()]++|(?1))*)\\))\\^(?|()(\\d*\\.?\\d+)|(\\((‌​(?:[^()]++|(?3))*)\\‌​)))", "pow(\\2, \\4)", v, perl=TRUE);

and to check if there are any more values to replace:

x <- grepl("(\\(((?:[^()]++|(?1))*)\\))\\^(?|()(\\d*\\.?\\d+)|(\\((‌​(?:[^()]++|(?3))*)\\‌​)))", v, perl=TRUE)
Community
  • 1
  • 1
Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
  • Hoho, I think this is what I need! Thanks! – Kalinkin Alexey Nov 15 '16 at 10:17
  • An alternative is to write a parser, but people usually forget that a regex engine is a parser as well, and modern regex engines are able to match non-regular grammar, e.g. balanced nested parentheses. – Wiktor Stribiżew Nov 15 '16 at 10:19
  • Hm, and what will be our regex looking if third group is have a parentheses? I mean if the input will be something like this: `(2-x+5)^(x-5)` – Kalinkin Alexey Nov 15 '16 at 10:50
  • Then a question is: can there be again nested parentheses? If yes, you will have to add a branch reset and an empty group "hack": `v <- sub("(\\(((?:[^()]++|(?1))*)\\))\\^(?|()(\\d*\\.?\\d+)|(\\(((?:[^()]++|(?3))*)\\)))", "pow(\\2, \\4)", v, perl=TRUE);`. See [the regex demo](https://regex101.com/r/eelDus/2) and a [R demo](https://ideone.com/kRQbXN). – Wiktor Stribiżew Nov 15 '16 at 11:10
  • 1
    This appears to create an infinite loop with `v <- "((2-x^2+3)^2+(x-5+7)^10)^0.5"` (note `x^2`). I think a regex based solution here will inevitably run into corner cases that are hard to anticipate. – BrodieG Nov 15 '16 at 14:52
  • @BrodieG: No, it can easily be fixed with a `grepl` check fix. I added the fix for that. There cannot be any "edge" cases *if* the texts are always in the stated format. – Wiktor Stribiżew Nov 15 '16 at 14:53
  • But in that case it doesn't replace the `x^2`, right, or am I missing something? – BrodieG Nov 15 '16 at 15:03
  • @BrodieG: I understood OP wanted to convert to `pow` only those that are inside parentheses. I deduced it after reading `(\([^()]*)*(\s*\([^()]*\)\s*)+([^()]*\))*` expression where `(....)` part is *obligatory*. Again, it can be easily updated. I also did not consider cases like `^x`. Let OP clarify. I asked a question below the original post. – Wiktor Stribiżew Nov 15 '16 at 15:09
  • 5
    Hacking our way to the answer is advantageous when there isn't a clear-cut path, but in this case, R has a definitive method for accomplishing this. – Pierre L Nov 15 '16 at 15:12
  • @PierreLafortune: Unless there is a requirement: only when there are parentheses before `^`. – Wiktor Stribiżew Nov 15 '16 at 15:13
  • 1
    I am not doubting whether the proposed solution works or not, I'm referring to using our knowledge of R to get to the simplest answer that works with the language instead of a side-door hack into it. – Pierre L Nov 15 '16 at 15:16
  • @PierreLafortune: I do not insist, if OP claims it is not the solution and unaccepts, I will remove the answer. – Wiktor Stribiżew Nov 15 '16 at 15:18
  • 3
    @WiktorStribiżew You don't need to remove it. It is a valuable answer and one can learn more about regex from it. Voting will point to the answer most well liked and future users can take votes into consideration. – Roland Nov 15 '16 at 15:23
  • 2
    No need to delete. There's a lot to learn from the answer. Not many people know about Possessive Quantifiers like `[^()]++`. *raises hand – Pierre L Nov 15 '16 at 15:30
  • Regexps cannot count - this is usually learned the hard way with working with XML and HTML, but applies here too. – Thorbjørn Ravn Andersen Nov 16 '16 at 08:33
  • @ThorbjørnRavnAndersen: **cannot count** is an ambiguous statement. A regex cannot add `1+1` to give you `2`, but a PCRE regex can match n to m chars (consecutive - `a{1,3}`, non-consecutive - `(?:[^a]*a){1,3}`), it can match a string between matching parentheses (`\((?:[^()]++|(?R))*\)`) and other pairs of chars, and it can [Match correct addition of two binary numbers](http://stackoverflow.com/questions/39553372/match-correct-addition-of-two-binary-numbers-with-pcre-regex). And more if you search SO. – Wiktor Stribiżew Nov 16 '16 at 08:37
  • @WiktorStribiżew It cannot count in the sense of maintaining a stack. If you need to match `

    ...

    ` you cannot easily specify which `

    ` a given `

    ` belongs to, making it hard to handle matches in matches correctly
    – Thorbjørn Ravn Andersen Nov 16 '16 at 08:53
  • @ThorbjørnRavnAndersen: Recursion used in PCRE regex does not really maintain a stack, but it is exactly what helps you achieve what you just said is impossible - [`

    (?:(?:(?!<\/?p>).)*+|(?R))*<\/p>`](https://regex101.com/r/enN0UD/2).

    – Wiktor Stribiżew Nov 16 '16 at 09:28
  • @WiktorStribiżew I am not familiar with the PCRE dialect but it appears it is extended quite a bit from the original (the call subprocedure may be quite a step towards Turing completeness). So if I understand your comment correctly this regexp can match an arbitrarily deep

    -in-

    pair?

    – Thorbjørn Ravn Andersen Nov 16 '16 at 10:56
  • @ThorbjørnRavnAndersen: Well, I must admit that there is a recursion depth limit that can be controlled to some extent. But it should deal with a hundred or two levels fine. – Wiktor Stribiżew Nov 16 '16 at 10:58
  • @WiktorStribiżew I just read you revised regexp (for some corner case): `v <- sub("(\\(((?:[^()]++|(?1))*)\\))\\^(?|()(\\d*\\.?\\d+)|(\\((‌​(?:[^()]++|(?3))*)\\‌​)))", "pow(\\2, \\4)", v, perl=TRUE);`- I would venture this is beginning to be a rather "hard-to-maintain"-solution and a bit beyond the "Ok, we need a parser"-point. – Thorbjørn Ravn Andersen Nov 16 '16 at 12:27
  • @ThorbjørnRavnAndersen: If you check the question history, my comments, and so on, you will see it started as an easy to solve with regex question. Later users generalized the OP task, and made my answer look bad. I did not really try to provide a generic solution to parse math expressions, though I can extend my current approach and make it maintainable by using blocks. I know it is not the best approach at all for a general math expression parsing task. If it were formulated like that, and the OP regex did not contain those parentheses, I would not even bother posting a regex solution. – Wiktor Stribiżew Nov 16 '16 at 12:32
13

Here is a solution that follows the parse tree recursively and replaces ^:

#parse the expression
#alternatively you could create it with
#expression(((2-x+3)^2+(x-5+7)^10)^0.5)
e <- parse(text = "((2-x+3)^2+(x-5+7)^10)^0.5")

#a recursive function
fun <- function(e) {    
  #check if you are at the end of the tree's branch
  if (is.name(e) || is.atomic(e)) { 
    #replace ^
    if (e == quote(`^`)) return(quote(pow))
    return(e)
  }
  #follow the tree with recursion
  for (i in seq_along(e)) e[[i]] <- fun(e[[i]])
  return(e)    
}

#deparse to get a character string    
deparse(fun(e)[[1]])
#[1] "pow((pow((2 - x + 3), 2) + pow((x - 5 + 7), 10)), 0.5)"

This would be much easier if rapply worked with expressions/calls.

Edit:

OP has asked regarding performance. It is very unlikely that performance is an issue for this task, but the regex solution is not faster.

library(microbenchmark)
microbenchmark(regex = {
  v <- "((2-x+3)^2+(x-5+7)^10)^0.5"
  x <- grepl("(\\(((?:[^()]++|(?1))*)\\))\\^(\\d*\\.?\\d+)", v, perl=TRUE)
  while(x) {
    v <- sub("(\\(((?:[^()]++|(?1))*)\\))\\^(\\d*\\.?\\d+)", "pow(\\2, \\3)", v, perl=TRUE);
    x <- grepl("(\\(((?:[^()]++|(?1))*)\\))\\^(\\d*\\.?\\d+)", v, perl=TRUE)
  }
},
BrodieG = {
  deparse(f(parse(text = "((2-x+3)^2+(x-5+7)^10)^0.5")[[1]]))
},
Roland = {
  deparse(fun(parse(text = "((2-x+3)^2+(x-5+7)^10)^0.5"))[[1]])
})

#Unit: microseconds
#    expr     min      lq     mean  median      uq     max neval cld
#   regex 321.629 323.934 335.6261 335.329 337.634 384.623   100   c
# BrodieG 238.405 246.087 255.5927 252.105 257.227 355.943   100  b 
#  Roland 211.518 225.089 231.7061 228.802 235.204 385.904   100 a

I haven't included the solution provided by @digEmAll, because it seems obvious that a solution with that many data.frame operations will be relatively slow.

Edit2:

Here is a version that also handles sqrt.

fun <- function(e) {    
  #check if you are at the end of the tree's branch
  if (is.name(e) || is.atomic(e)) { 
    #replace ^
    if (e == quote(`^`)) return(quote(pow))
    return(e)
  }
  if (e[[1]] == quote(sqrt)) {
    #replace sqrt
    e[[1]] <- quote(pow)
    #add the second argument
    e[[3]] <- quote(0.5)
  }
  #follow the tree with recursion
  for (i in seq_along(e)) e[[i]] <- fun(e[[i]])
  return(e)    
}

e <- parse(text = "sqrt((2-x+3)^2+(x-5+7)^10)")
deparse(fun(e)[[1]])
#[1] "pow(pow((2 - x + 3), 2) + pow((x - 5 + 7), 10), 0.5)"
Roland
  • 127,288
  • 10
  • 191
  • 288
  • 1
    Beat you by two mins ;P. Agree with `rapply`, was my first attempt... I always try to find a way to use `rapply` and get frustrated. – BrodieG Nov 15 '16 at 14:43
  • 1
    I saw that but thought I'd post my solution anyway since it appears to be sufficiently different. It was a fun challenge figuring this out. :) – Roland Nov 15 '16 at 14:48
7

Here's an example exploiting R parser (using getParseData function) :

# helper function which turns getParseData result back to a text expression
recreateExpr <- function(DF,parent=0){
  elements <- DF[DF$parent == parent,]
  s <- ""
  for(i in 1:nrow(elements)){
    element <- elements[i,]
    if(element$terminal)
      s <- paste0(s,element$text)
    else
      s <- paste0(s,recreateExpr(DF,element$id))
  }
  return(s)  
}

expr <- "((2-x+3)^2+(x-5+7)^10)^0.5"

DF <- getParseData(parse(text=expr))[,c('id','parent','token','terminal','text')]

# let's find the parents of all '^' expressions
parentsOfPow <- unique(DF[DF$token == "'^'",'parent'])

# replace all the the 'x^y' expressions with 'pow(x,y)' 
for(p in parentsOfPow){
  idxs <- which(DF$parent == p)
  if(length(idxs) != 3){ stop('expression with '^' is not correct')  }

  idxtok1 <- idxs[1]
  idxtok2 <- idxs[2]
  idxtok3 <- idxs[3]

  # replace '^' token with 'pow'
  DF[idxtok2,c('token','text')] <- c('pow','pow')

  # move 'pow' token as first token in the expression
  tmp <- DF[idxtok1,]
  DF[idxtok1,] <- DF[idxtok2,]
  DF[idxtok2,] <- tmp

  # insert new terminals '(' ')' and ','
  DF <- rbind(
    DF[1:(idxtok2-1),],
    data.frame(id=max(DF$id)+1,parent=p,token=',',terminal=TRUE,text='(',
               stringsAsFactors=FALSE),
    DF[idxtok2,],
    data.frame(id=max(DF$id)+2,parent=p,token=',',terminal=TRUE,text=',',
               stringsAsFactors=FALSE),
    DF[(idxtok2+1):idxtok3,],
    data.frame(id=max(DF$id)+3,parent=p,token=')',terminal=TRUE,text=')',
               stringsAsFactors=FALSE),
    if(idxtok3<nrow(DF)) DF[(idxtok3+1):nrow(DF),] else NULL
  )
}

# print the new expression
recreateExpr(DF)

> [1] "pow((pow((2-x+3),2)+pow((x-5+7),10)),0.5)"
digEmAll
  • 56,430
  • 9
  • 115
  • 140