C++: Extracting symbols/variables of an analytical mathematical expression

Question

I have expressions that can be provided by the user, such as:

 a*sin(w*t) 
 a+b/c
 x^2+y^2/2

And I would like to just get the list of variables there. I don't need to do any substitutions. So, for the first formula it's gonna be {a,w,t}. For the second one {a,b,c}, and for the last one {x,y}.

The expression is primarily written to be parsed with Sympy, but I need to be able to get the list of variables in C++ for some checks. I would like to:

Avoid having to link the whole Python interpreter to my program
Avoid reinventing the wheel, as I saw there are many parsing libraries available, such as muparser, but I don't know if any of these provide this functionality

What's the easiest way to do this? How would you tackle this problem?

Can you provide a grammar or some description of the expression format? From the 3 examples it looks like it would be enough to split the string using any non-alphabetic character as delimeter, make unique, remove known names like "sin". But I guess this would break on some more complicated expressions. — michalsrb, Dec 06 '16 at 14:48
Are your user expressions allowed to multiply variables without the use of an asterisks (for example given `a` and `b` are either of these valid: `ab` or `a(b)`) Also will your user expressions contain more than 1 letter variables (for example is `xy` a valid variable name) — Jonathan Mee, Dec 06 '16 at 14:51
@michalsrb Perhaps that splitting idea is a good start. Thanks for the idea, I'll think about it. — The Quantum Physicist, Dec 06 '16 at 14:55
@JonathanMee More than 1 letter variables can exist. Multiplication without asterisk is not allowed. — The Quantum Physicist, Dec 06 '16 at 14:56

Jonathan Mee · Accepted Answer · 2017-09-29T22:53:21.983

2

Given an the input: const string input we can collect or variables into set<string> with a regex:

\b([a-zA-Z]\w*)(?:[^(a-zA-Z0-9_]|$)

You could use this in C++ as follows:

const regex re{ "\\b([a-zA-Z]\\w*)(?:[^(a-zA-Z0-9_]|$)" };
const set<string> output{ sregex_token_iterator(cbegin(input), cend(input), re, 1), sregex_token_iterator() };

Live Example

EDIT:

regex explanation:

\b asserts a \W character or the beginning or end of the string
([a-zA-Z] captures anything begining with an alphabetic charachter
\w*) followed by any number of "word" characters
(?: specifies the start of my non-capturing optional match
[[^(a-zA-Z0-9_] the 1^st option is a non-open-parenthesis \W character
|$) the other option is that the end of the input has been reached

edited Sep 29 '17 at 22:53

answered Dec 06 '16 at 15:11

Jonathan Mee

37,899
23
129
288

Thanks. I'll come back to this soon and test it :) – The Quantum Physicist Dec 06 '16 at 15:53
1

@TheQuantumPhysicist I've added some explanation to the answer. Please find the http://regex101.com and http://ideone.com links in the answer useful for testing purposes. – Jonathan Mee Dec 06 '16 at 15:57
Just noticed that this doesn't support numbers after the character (and other peaceful symbols, like underscores). I'll modify your answer later with that (or you're welcome to do that if you have time). – The Quantum Physicist Dec 06 '16 at 16:01
@TheQuantumPhysicist Yeah to match non-leading `'\w` characters, we'll need to change: `[a-zA-Z]+` to `[a-zA-Z]\w*` and `[^(a-zA-Z]` to `[^(a-zA-Z0-9_]` I've done so and updated the answer. – Jonathan Mee Dec 06 '16 at 16:56
Thanks a lot! You know, maybe this is a little off topic, but there's 1 regex last part I wasn't able to implement. I want that only a single underscore "\_" be possible. So everything should remain the same, but a string with "__" (so 2 underscores) should fail. – The Quantum Physicist Dec 06 '16 at 16:59
@TheQuantumPhysicist It will. I've updated the explanation section of the answer. The only things that will be captured are words *beginning with an alphabetic character.* – Jonathan Mee Dec 06 '16 at 17:02
Actually I meant that underscores can happen only once in a row. So `var_` is OK, `va_r_` is OK, but `var__` is not OK. – The Quantum Physicist Dec 06 '16 at 17:32
@TheQuantumPhysicist That can be done, but it will add significant complexity. I'd suggest accepting this and opening a new question where we solve that problem. – Jonathan Mee Dec 06 '16 at 17:49
I just discovered that this doesn't work for expressions that have numbers in the c-form, like 1e-5. Do you have a suggestion there? – The Quantum Physicist Dec 22 '16 at 18:18
@TheQuantumPhysicist This is going to be a hard conversation in the comments. I'd suggest opening a new question putting a link here in the comments and I'll come find it. In your question please define what you mean by "doesn't work" this regex won't match "1e-5" as a variable, which I view as desired behavior. In your question put an input and expected output; I'll look it over and see if we can adjust to correct matching. – Jonathan Mee Dec 22 '16 at 19:54

C++: Extracting symbols/variables of an analytical mathematical expression

1 Answers1

Linked