0

I'm trying to build a regex in JavaScript that will match parts of an arithmetic operation. For instance, here are a few inputs and expected outputs:

What is 7 minus 5?          >> ['7','minus','5']
What is 6 multiplied by -3? >> ['6','multiplied by', '-3']

I have this working regex: /^What is (-?\d+) (minus|plus|multiplied by|divided by) (-?\d+)\?$/

Now I want to expand things to capture additional operations. For instance:

What is 7 minus 5 plus 3?  >> ['7','minus','5','plus','3']

So I used: ^What is (-?\d+)(?: (minus|plus|multiplied by|divided by) (-?\d+))+\?$. But it yields:

What is 7 minus 5 plus 3?  >> ['7','plus','3']

Why is the minus 5 skipped? And how do I include it in results as I'd like? (here is my sample)

BeetleJuice
  • 39,516
  • 19
  • 105
  • 165
  • Are you interested in just *why*? See [Java regex: Repeating capturing groups](https://stackoverflow.com/questions/6939526/java-regex-repeating-capturing-groups). – Wiktor Stribiżew Oct 04 '17 at 13:11
  • @WiktorStribiżew I would also like to know how to fix it to get the desired matches – BeetleJuice Oct 04 '17 at 13:13
  • If this is to make a parser from english to real mathematical symbols I would not use regex at all and just use a hash table containing 'multiplied by', 'minus', 'plus', 'divided by' etc. Your logic seems to be 'operand operator operand operator'. It'll make the code easier to expand and makes changing language one loop away. – Shilly Oct 04 '17 at 14:33

2 Answers2

2

The problem you are facing comes from the fact that a capturing group can only return one value. If the same capturing group would have more than one value (like it is in your case) it would always return the last one.

I like how it is explained at http://www.rexegg.com/regex-capture.html#spawn_groups

The capturing parentheses you see in a pattern only capture a single group. So in (\d)+, capture groups do not magically mushroom as you travel down the string. Rather, they repeatedly refer to Group 1, Group 1, Group 1… If you try this regex on 1234 (assuming your regex flavor even allows it), Group 1 will contain 4—i.e. the last capture.

In essence, Group 1 gets overwritten every time the regex iterates through the capturing parentheses.

So the trick for you is use a regex with the global flag (g) and execute the expression more than once, when using the g flag, the following execution starts where the last one ended.

I've made a regex to show you the strategy, isolate the formula and then iterate until you found everything.

var formula = "What is 2 minus 1 minus 1";
var regex = /^What is ((?:-?\d+)(?: (?:minus|plus|multiplied by|divided by) (?:-?\d+))+)$/

if (regex.exec(formula).length > 1) {
  var math_string = regex.exec(formula)[1];
  console.log(math_string);
  var math_regex = /(-?\d+)? (minus|plus|multiplied by|divided by) (-?\d+)/g
  var operation;
  var result = [];
  while (operation = math_regex.exec(math_string)) {
    if (operation[1]) {
      result.push(operation[1]);
    }
    result.push(operation[2], operation[3]);
  }
  console.log(result);
}

Another solution, if you aren't requiring anything fancy would be to remove the "What is", replace multiplied by with multiplied_by (same for divided) and split the string on spaces.

var formula = "What is 2 multiplied by 1 divided by 1";
var regex = /^What is ((?:-?\d+)(?: (?:minus|plus|multiplied by|divided by) (?:-?\d+))+)$/

if (regex.exec(formula).length > 1) {
  var math_string = regex.exec(formula)[1].replace('multiplied by', 'multiplied_by').replace('divided by', 'divided_by');
  console.log(math_string.split(" "));
}
Salketer
  • 14,263
  • 2
  • 30
  • 58
  • Thanks for the creative answers, and for teaching me about successive execution. I think it would be helpful to others to link to [Mozilla docs](https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/RegExp/exec) as well. – BeetleJuice Oct 04 '17 at 18:42
1

Each capturing group in a regex can only hold a single value. So, if you have a repetition on a group, you're only going to get one result for that group (usually the last one, I think). In your case it's the following:

(?: (minus|plus|multiplied by|divided by) (-?\d+))+

You're repeating the non-capturing group around, which will match repeatedly. But the groups within can, in the end, only hold a single match, which is the result of the last repetition.

You should probably switch to matching tokens instead of having a single regex that tries to match the whole phrase and dissects it via capturing groups. Something like a two-step process where you first verify that the whole phrase is constructed correctly (starts with »What is«, ends with »?«, etc.) and then a pass that extracts the individual tokens, e.g. something like

-?\d+|minus|plus|multiplied by|divided by
Joey
  • 344,408
  • 85
  • 689
  • 683
  • I wasn't aware that there is no way to implement a `forEach` that would capture every instance of a matching group. I guess it makes sense though, since allowing this would make back-references `\n` unreliable – BeetleJuice Oct 04 '17 at 13:22