10

I have a program in which a user inputs a function, such as sin(x)+1. I'm using ast to try to determine if the string is 'safe' by whitelisting components as shown in this answer. Now I'd like to parse the string to add multiplication (*) signs between coefficients without them.

For example:

  • 3x-> 3*x
  • 4(x+5) -> 4*(x+5)
  • sin(3x)(4) -> sin(3x)*(4) (sin is already in globals, otherwise this would be s*i*n*(3x)*(4)

Are there any efficient algorithms to accomplish this? I'd prefer a pythonic solution (i.e. not complex regexes, not because they're pythonic, but just because I don't understand them as well and want a solution I can understand. Simple regexes are ok. )

I'm very open to using sympy (which looks really easy for this sort of thing) under one condition: safety. Apparently sympy uses eval under the hood. I've got pretty good safety with my current (partial) solution. If anyone has a way to make sympy safer with untrusted input, I'd welcome this too.

Community
  • 1
  • 1
Luke Taylor
  • 8,631
  • 8
  • 54
  • 92
  • Regex is quite pythonic. Can't think of anything clearer and more concise for the job. – roadrunner66 Mar 10 '16 at 23:52
  • Sure, I guess… I guess I was thinking more along the lines of `str.replace` and `index`, etc. but regex is fine I guess. – Luke Taylor Mar 10 '16 at 23:54
  • Do you actually parse those expressions (build an arithmetic tree?) – fodma1 Mar 11 '16 at 00:07
  • @fodma1 In short, I don't build an *arithmetic* tree but a Python Syntax tree. This processing would need to happen before that. – Luke Taylor Mar 11 '16 at 00:27
  • 1
    SymPy has some functions that can do this in `sympy.parsing.sympy_parser`. They're designed to go straight from strings to SymPy expressions, but I think you can also use the individual transformations directly on a tokenized string. – asmeurer Mar 11 '16 at 22:09
  • But are they safe? I read they use eval – Luke Taylor Mar 11 '16 at 22:09
  • 1
    @LukeTaylor if you use the functions that go straight to SymPy expressions, yes, those use eval. But for your application you can tokenize the input (SymPy also has a tokenizer, based on the Python one, with some extensions I believe), and use these transformers on the tokenized input. Then convert that to AST, and do your filtering. Sorry I don't have time to write this up as an answer but take a look at https://github.com/sympy/sympy/blob/e098719e3a15c36089926ff153f6448b5612b8a1/sympy/parsing/sympy_parser.py#L782 (and the other functions in that file). – asmeurer Mar 12 '16 at 01:11
  • 1
    I actually want to have some better filtering in SymPy anyway, because we don't need the full Python language, and we could remove some features that make eval unsafe and it wouldn't hurt sympify (like attribute access). – asmeurer Mar 12 '16 at 01:12
  • @asmeurer I'd love this too. Maybe I'll open an issue on GitHub – Luke Taylor Mar 12 '16 at 01:13

1 Answers1

8

A regex is easily the quickest and cleanest way to get the job done in vanilla python, and I'll even explain the regex for you, because regexes are such a powerful tool it's nice to understand.

To accomplish your goal, use the following statement:

import re
# <code goes here, set 'thefunction' variable to be the string you're parsing>
re.sub(r"((?:\d+)|(?:[a-zA-Z]\w*\(\w+\)))((?:[a-zA-Z]\w*)|\()", r"\1*\2", thefunction)

I know it's a bit long and complicated, but a different, simpler solution doesn't make itself immediately obvious without even more hacky stuff than what's gone into the regex here. But, this has been tested against all three of your test cases and works out precisely as you want.

As a brief explanation of what's going on here: The first parameter to re.sub is the regular expression, which matches a certain pattern. The second is the thing we're replacing it with, and the third is the actual string to replace things in. Every time our regex sees a match, it removes it and plugs in the substitution, with some special behind-the-scenes tricks.

A more in-depth analysis of the regex follows:

  • ((?:\d+)|(?:[a-zA-Z]\w*\(\w+\)))((?:[a-zA-Z]\w*)|\() : Matches a number or a function call, followed by a variable or parentheses.
    • ((?:\d+)|(?:[a-zA-Z]\w*\(\w+\))) : Group 1. Note: Parentheses delimit a Group, which is sort of a sub-regex. Capturing groups are indexed for future reference; groups can also be repeated with modifiers (described later). This group matches a number or a function call.
      • (?:\d+) : Non-capturing group. Any group with ?: immediately after the opening parenthesis will not assign an index to itself, but still act as a "section" of the pattern. Ex. A(?:bc)+ will match "Abcbcbcbc..." and so on, but you cannot access the "bcbcbcbc" match with an index. However, without this group, writing "Abc+" would match "Abcccccccc..."
        • \d : Matches any numerical digit once. A regex of \d all its own will match, separately, "1", "2", and "3" of "123".
        • + : Matches the previous element one or more times. In this case, the previous element is \d, any number. In the previous example, \d+ on "123" will successfully match "123" as a single element. This is vital to our regex, to make sure that multi-digit numbers are properly registered.
      • | : Pipe character, and in a regex, it effectively says or: "a|b" will match "a" OR "b". In this case, it separates "a number" and "a function call"; match a number OR a function call.
      • (?:[a-zA-Z]\w*\(\w+\)) : Matches a function call. Also a non-capturing group, like (?:\d+).
        • [a-zA-Z] : Matches the first letter of the function call. There is no modifier on this because we only need to ensure the first character is a letter; A123 is technically a valid function name.
        • \w : Matches any alphanumeric character or an underscore. After the first letter is ensured, the following characters could be letters, numbers, or underscores and still be valid as a function name.
        • * : Matches the previous element 0 or more times. While initially seeming unnecessary, the star character effectively makes an element optional. In this case, our modified element is \w, but a function doesn't technically need any more than one character; A() is a valid function name. A would be matched by [a-zA-Z], making \w unnecessary. On the other end of the spectrum, there could be any number of characters following the first letter, which is why we need this modifier.
        • \( : This is important to understand: this is not another group. The backslash here acts much like an escape character would in a normal string. In a regex, any time you preface a special character, such as parentheses, +, or * with a backslash, it uses it like a normal character. \( matches an opening parenthesis, for the actual function call part of the function.
        • \w+ : Matches a number, letter or underscore one or more times. This ensures the function actually has a parameter going into it.
        • \) : Like \(, but matches a closing parenthesis
    • ((?:[a-zA-Z]\w*)|\() : Group 2. Matches a variable, or an opening parenthesis.
      • (?:[a-zA-Z]\w*) : Matches a variable. This is the exact same as our function name matcher. However, note that this is in a non-capturing group: this is important, because of the way the OR checks. The OR immediately following this looks at this group as a whole. If this was not grouped, the "last object matched" would be \w*, which would not be sufficient for what we want. It would say: "match one letter followed by more letters OR one letter followed by a parenthesis". Putting this element in a non-capturing group allows us to control what the OR registers.
      • | : Or character. Matches (?:[a-zA-Z]\w*) or \(.
      • \( : Matches an opening parenthesis. Once we have checked if there is an opening parenthesis, we don't need to check anything beyond it for the purposes of our regex.

Now, remember our two groups, group one and group two? These are used in the substitution string, "\1*\2". The substitution string is not a true regex, but it still has certain special characters. In this case, \<number> will insert the group of that number. So our substitution string is saying: "Put group 1 in (which is either our function call or our number), then put in an asterisk (*), then put in our second group (either a variable or a parenthesis)"

I think that about sums it up!

MutantOctopus
  • 3,431
  • 4
  • 22
  • 31