2

I've been struggling with this for quite awhile (not being a regex ninja), searching stackoverflow and through trial an error. I think I'm close, but there are still a few hiccups that I need help sorting out.

The requirements are such that a given equation, that includes variables, exponents, etc, are split by the regex pattern after variables, constants, values, etc. What I have so far

     Regex re = new Regex(@"(\,|\(|\)|(-?\d*\.?\d+e[+-]?\d+)|\+|\-|\*|\^)");
     var tokens = re.Split(equation)

So an equation such as

    2.75423E-19* (var1-5)^(1.17)* (var2)^(1.86)* (var3)^(3.56)

should parse to

     [2.75423E-19 ,*, (, var1,-,5, ), ^,(,1.17,),*....,3.56,)]

However the exponent portion is getting split as well which I think is due to the regex portion: |+|-.

Other renditions I've tried are:

    Regex re1 = new Regex(@"([\,\+\-\*\(\)\^\/\ ])"); and 
    Regex re = new Regex(@"(-?\d*\.?\d+e[+-]?\d+)|([\,\+\-\*\(\)\^\/\ ])");

which both have there flaws. Any help would be appreciated.

J newson
  • 19
  • 3
  • 1
    How do you plan to disambiguate minus with negative values and as an arithmetic operator? Or is it not necessary here? BTW, check [`[0-9]*\.?[0-9]+([eE][-+]?[0-9]+)?|[-^+*/()]|\w+`](http://regexstorm.net/tester?p=%5b0-9%5d*%5c.%3f%5b0-9%5d%2b(%5beE%5d%5b-%2b%5d%3f%5b0-9%5d%2b)%3f%7c%5b-%5e%2b*%2f()%5d%7c%5cw%2b&i=2.75423E-19*+(var1-5)%5e(1.17)*+(var2)%5e(1.86)*+(var3)%5e(3.56)) that *matches* the tokens. – Wiktor Stribiżew Jan 13 '16 at 22:33
  • 2
    IMHO I believe you'd be better off looking at a proper parsing mechanism. – Lee Taylor Jan 13 '16 at 22:36
  • @stribizhev you should post that as an answer, since it properly tokenizes the text. BTW in arithmetic parsing you don't usually deal with negative number tokens, but treat numbers like a positive number with an unary minus operator. And to OP, if you need to write a custom parser, you may be interested in [this answer] of mine, or maybe you could use something like [NCalc](https://ncalc.codeplex.com/) if it fits your needs. – Lucas Trzesniewski Jan 14 '16 at 09:03
  • @LucasTrzesniewski: I am actually not sure if I should, but since you think I should, I have :) – Wiktor Stribiżew Jan 14 '16 at 09:32
  • @stribizhev I have an additional method that adjusts for urnary operators. – J newson Jan 14 '16 at 13:55
  • @LucasTrzesniewski you mentioned that I may be interested in [this answer], is there a missing hyperlink? Thanks again for all of your help. – J newson Jan 14 '16 at 13:55
  • @Jnewson oops, yes I forgot to paste the link. [Here it is](http://stackoverflow.com/a/29996191/3764814) - it involves a custom ANTLR parser. – Lucas Trzesniewski Jan 14 '16 at 13:58

2 Answers2

4

For the equations like the one posted in the original question, you can use

[0-9]*\.?[0-9]+([eE][-+]?[0-9]+)?|[-^+*/()]|\w+

See regex demo

The regex matches:

  • [0-9]*\.?[0-9]+([eE][-+]?[0-9]+)? - a float number
  • | - or...
  • [-^+*/()] - any of the arithmetic and logical operators present in the equation posted
  • | - or...
  • \w+ - 1 or more word characters (letters, digits or underscore).

For more complex tokenization, consider using NCalc suggested by Lucas Trzesniewski's comment.

C# sample code:

var line = "2.75423E-19* (var1-5)^(1.17)* (var2)^(1.86)* (var3)^(3.56)";
var matches = Regex.Matches(line, @"[0-9]*\.?[0-9]+([eE][-+]?[0-9]+)?|[-^+*/()]|\w+");
foreach (Match m in matches)
    Console.WriteLine(m.Value);

And updated code for you to show that Regex.Split is not necessary here:

var result = Regex.Matches(line, @"\d+(?:[,.]\d+)*(?:e[-+]?\d+)?|[-^+*/()]|\w+", RegexOptions.IgnoreCase)
             .Cast<Match>()
             .Select(p => p.Value)
             .ToList();

Also, to match formatted numbers, you can use \d+(?:[,.]\d+)* rather than [0-9]*\.?[0-9]+ or \d+(,\d+)*.

Community
  • 1
  • 1
Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
  • Thanks for your solution, however I am still not quite sure it is correct. While the regex demo indicates that all the correct elements where matched, when I implemented it and split expression I'm getting an extra E-19 element in my array. Maybe its my misunderstanding of the regex library, I suppose I could loop over the match collection, however this could lead to unforeseen issues when implementing other equations. – J newson Jan 14 '16 at 13:49
  • @Jnewson did you use the verbatim string syntax to pass the pattern? `@"[0-9]*\.?[0-9]+([eE][-+]?[0-9]+)?|[-^+*/()]|\w+"` - like this? – Lucas Trzesniewski Jan 14 '16 at 14:00
  • @Lucas Trzesniewski I had to add it as a group like this @"([0-9]*\.?[0-9]+([eE][-+]?[0-9]+)?|[-^+*/,()]|\w+)" in order for it to get an output, otherwise I got an empty array when I performed the split. – J newson Jan 14 '16 at 15:11
  • The regex I provided is not for `Regex.Split`, it is for `Regex.Matches`. When you enclose a part of a pattern with `(....)` this submatch is output in the resulting array during `Regex.Split`. See the demo I provided. If you are to use `Regex.Split`, use `@"([0-9]*\.?[0-9]+(?:[eE][-+]?[0-9]+)?|[-^+*/,()]|\w+)"` – Wiktor Stribiżew Jan 14 '16 at 15:13
  • Thanks @stribizhev, I would upvote this because it led me to a solution,but unfortunately I don't have enough reputations. – J newson Jan 15 '16 at 13:04
  • You can *accept* my solution since it is the one that helped you the most, and add yours to the question. No need posting it as an answer. Your `\d+(,\d+)*(?:.\d+)?` is not the best for matching formatted numbers. You can use `\d+(?:[,.]\d+)*` instead. And there is no need splitting! You can use `var result = Regex.Matches(line, @"\d+(?:[,.]\d+)*(?:[eE][-+]?[0-9]+)?|[-^+*/()]|\w+").Cast().Select(p => p.Value).ToList();` – Wiktor Stribiżew Jan 15 '16 at 13:07
-1

So I think I've got a solution thanks to @stribizhev solution lead me to the regex solution

            Regex re = new Regex(@"(\d+(,\d+)*(?:.\d+)?(?:[eE][-+]?[0-9]+)?|[-^+/()]|\w+)");
            tokenList = re.Split(InfixExpression).Select(t => t.Trim()).Where(t => t != "").ToList();  

When split gives me the desired array.

J newson
  • 19
  • 3