tokenizing mathematical equation using regex

Question

I'm trying to split up an equation string into tokens. Ive found a good starting point '([A-Za-z]+|[0-9.]+|[&=><\|!]+|\S)'. However this has trouble with negative numbers:

turns: '5--4=sin(2+3)'
into: ['5','-','-','4','=','sin','(','2','+','3',')']
want: ['5','-','-4','=','sin','(','2','+','3',')']

and also

turns: -3+3
into: ['-','3','+','3']
want: ['-3','+','3']

It looks like a my regex could use something that checks if there is a number to the left of the '-' if not keep it with the next number(note '-3' has nothing to the left). Can it be done using regex? Or is there a better tool to split this up in .NET?

score 2 · Answer 1 · edited May 23 '17 at 12:33

You are not approaching the problem correctly. The result you actually got is the correct one.

-3+3 should parse to:

operator binary +
|
+-- operator unary -
|   |
|   +-- 3
|
+-- 3

It will be much easier to reason about math expressions this way, you'll avoid many ambiguities. Let just - always be a token on its own, and use it either as a binary minus, or an unary negation operator.

See here for a related answer of mine which approaches the problem this way (it uses ANTLR but the lexing pass does exactly what I'm advising you to do).

score 1 · Accepted Answer · answered Jan 12 '17 at 20:21

1

Regex is not powerful enough to do what you want in all contexts. Although you can make regex recognize + or - as part of an integer literal, for example, by adding an optional [+-]? in front of a digit sequence, the resultant regex would opt to tokenize '-3+3' as ['-3', '+3'] (demo).

Using a lexer generator should fix this problem; alternatively, you can deal with "bundling" unary operators with their operands in the parser.

answered Jan 12 '17 at 20:21

Sergey Kalinichenko

714,442
84
1,110
1,523

Oh come on, regex is perfectly suited for lexing - it's a Chomsky type 3 problem. OP doesn't realize that the result he got is actually *exactly* what he needs. in `-` `3`, the `-` is actually the unary negation operator. – Lucas Trzesniewski Jan 12 '17 at 20:38
@LucasTrzesniewski Of course, regex is perfectly suited for lexing, but OP wants his lexing to be context-sensitive. He wants two minuses in `-3-3` treated differently, which is neither what he wants nor what regex can deliver. – Sergey Kalinichenko Jan 12 '17 at 20:54
Yes, after reading your answer for a second time now I get what you meant. BTW a lexer generator won't just magically fix the "problem" either, most just use regex under the hood ;) – Lucas Trzesniewski Jan 12 '17 at 20:57
@LucasTrzesniewski I had ANTLR 3 in mind: I remember it producing some nice "magic" for me in terms of lexing complex expressions. I think they do it by adding an extra layer on "smartness" top of regular expressions, though. – Sergey Kalinichenko Jan 12 '17 at 21:06
Hmmm... there's no such thing in ANTLR as far as I can tell. It lets you inject code into lexer expressions though by using inline blocks `{...}` or predicates `{...}?` (I don't remember if the v3 lexer used `{...}?` or `{...}?=>` though), so you can achieve custom magic that way. – Lucas Trzesniewski Jan 12 '17 at 21:09
Your answer confused me but Lucas made me google words Ive never hear and now what your saying makes sense. Mainly the word 'unary' made me realize i was already using that with sin(x) in a way. Now as I loop through my un-typed string tokens I peek at the previous token to see if its a number otherwise consider it unary. Thanks to both Lucas and dasblinkenlight. I will have to try out Lucas's code when I get a chance. – ozz Jan 12 '17 at 22:16
Something like `((?<!^|[\-\+])[\-\+]|[\-\+]\d|[!<>]?=|\w+|.)` could work (if spaces are truncated). – Dávid Horváth Jan 05 '21 at 17:17

tokenizing mathematical equation using regex

2 Answers2