1

Probem:

I have a string containing different numbers, math signs and words, e.g.

str = ".1**2 + x/(10.0 - 2.E-4)*n_elts"

I would like to extract all numbers and keep the parts between the numbers so I can place it together again later (after working on the numbers).

lst = [".1", "**", "2", " + ", "x/(", "10.0", " - ", "2.E-4", ")*n_elts"]

would be one of many acceptable results. The elements which are not numbers can be split up further in any arbitrary way, since the next step will be

"".join(process(l) for l in lst)

where process could look like this (suggestions for a better way to check l is a number welcome):

def process(l):
    try:
        n = float(l)
    except ValueError:
        return l
    else:
        return work_on_it(l)

Current state:

From this answer I figured out how to keep the deliminators and worked my way to

lst = re.split('( |\+|\-|\*|/)', ".1**2 + x/(10.0 - 2.E-4)*n_elts")

Now I need to somehow avoid splitting the 2.E-4.

I tried to work out a regex (vi syntax, hope this is universal) that covers all numbers that could possibly appear and think

\d*\.\d*[E|e]*[|+|-]*\d*

should be ok.

One strategy would be to somehow get this into re.

I also found a related answer that seems to do the number matching part. It might be a bit more complex than I need, but mainly I do not know how to combine it with the keeping deliminators bit.

Community
  • 1
  • 1
Kyss Tao
  • 521
  • 4
  • 13

2 Answers2

2

One general note: inside character classes you don't use |, because it's just treated as another character to be matched. Inside character classes, the allowed characters are simply listed after one another.

To actually solve your problem: since you are keeping the delimiters anyway, it doesn't matter whether you are matching the numbers or the non-numbers right? So simply use

lst = re.split(r'(\d*\.\d*[Ee]*[+-]*\d*)', ".1**2 + x/(10.0 - 2.E-4)*n_elts")

You might want to improve on that number regex a bit though:

lst = re.split(r'((?:\d+\.\d*|\.?\d+)(?:[Ee][+-]?\d+)?)', ".1**2 + x/(10.0 - 2.E-4)*n_elts")

This way, you make the decimal point optional, but require at least one digit before or after it. This also makes the exponential part completely optional, but ensures it's well-formatted if it is present. The ?: suppresses capturing. Otherwise those inner groups would do the same as the outer set of parentheses, and add the parts that are matched inside to the result of split - you don't want that though, because that would give you the complete number, the part before the exponential, and the exponential separately. So you need to use ?: to suppress the capturing (which is in general a good habit unless you explicitly need capturing).

Finally, note the use of raw strings (the r preceding the string literal). Without this escaping can get really ugly (in that you may have to double escape certain regex meta-characters). In Python, you should always use raw strings to denote regex patterns.

Martin Ender
  • 43,427
  • 11
  • 90
  • 130
2

You can exploit that re.split() with a capturing regex returns matches at odd indexes, example:

import re

s = ".1**2 + x/(10.0 - 2.E-4)*n_elts"
parts = re.split(r"([+-]?(?:\d+(?:\.\d*)?|\.\d+)(?:[eE][+-]?\d+)?)", s)
parts[1::2] = [str(100 * float(f)) for f in parts[1::2]]
print("".join(parts))
# -> 10.0**200.0 + x/(1000.0 - 0.02)*n_elts

where the regex is from Python and regex question, extract float/double value.

Community
  • 1
  • 1
jfs
  • 399,953
  • 195
  • 994
  • 1,670
  • @ChrisF: no assumptions. From the docs: *separator components are always found at the same relative indices within the result list*. [Try it](http://ideone.com/16ytPI). – jfs May 06 '13 at 03:34
  • The odd even trick is nice. It seems unary operators become part of the first number anyways but even `'A'+s` works (which is important for me) – Kyss Tao May 06 '13 at 13:12