1

Given the code below, coming from the accepted answer of this question:

import re    
pathD = "M30,50.1c0,0,25,100,42,75s10.3-63.2,36.1-44.5s33.5,48.9,33.5,48.9l24.5-26.3"    
print(re.findall(r'[A-Za-z]|-?\d+\.\d+|\d+',pathD))    
['M', '30', '50.1', 'c', '0', '0', '25', '100', '42', '75', 's', '10.3', '-63.2', '36.1', '-44.5', 's', '33.5', '48.9', '33.5', '48.9', 'l', '24.5', '-26.3']

If I include symbols such as '$' or '£' in the pathD variable, the re expression skips them as it targets [A-Za-z] and digits

[A-Za-z] # words
|
-?\d+\.\d+ # floating point numbers
|
\d+ # integers

How do I modify the regex pattern above to also keep non alphanumeric symbols, as per desired output below?

new_pathD = '$100.0thousand'

new_re_expression = ???

print(re.findall(new_re_expression, new_pathD))

['$', '100.0', 'thousand']

~~~

Relevant SO posts below, albeit I could not exactly find how to keep symbols in the split exercise:

Split string into letters and numbers

split character data into numbers and letters

Python regular expression split string into numbers and text/symbols

Python - Splitting numbers and letters into sub-strings with regular expression

Pedro Rodrigues
  • 2,520
  • 2
  • 27
  • 26
Pythonic
  • 2,091
  • 3
  • 21
  • 34

1 Answers1

4

Try this:

compiled = re.compile(r'[A-Za-z]+|-?\d+\.\d+|\d+|\W')
compiled.findall("$100.0thousand")
# ['$', '100.0', 'thousand']

Here's an Advanced Edition™

advanced_edition = re.compile(r'[A-Za-z]+|-?\d+(?:\.\d+)?|(?:[^\w-]+|-(?!\d))+')

The difference is:

compiled.findall("$$$-100thousand")  # ['$', '$', '$', '-', '100', 'thousand']
advanced_edition.findall("$$$-100thousand")  # ['$$$', '-100', 'thousand']
iBug
  • 35,554
  • 7
  • 89
  • 134
  • Bingo - accepted. For my understanding, what does `\W` target exactly? (Need to wait 9 min to accept actually) – Pythonic Dec 29 '18 at 12:01
  • @Pythonic Anything that is not a "word character", roughly equivalent to `[^A-Za-z0-9_]` (note the underscore), – iBug Dec 29 '18 at 12:02
  • Perfect, `advanced_edition` does it, for the point you highlighted that signs are implicitly meant to stick together with numbers in the given use case – Pythonic Dec 29 '18 at 12:12
  • @PedroRodrigues I approved your edit suggestion and reverted it. It's because your suggestion is valid and sensible, but I don't like it (so I didn't reject it straightforward). Thank you anyway. – iBug Dec 29 '18 at 12:18
  • Just plain simple case to not use minified regexes. But go ahead do your crazy stuff. I'm out. – Pedro Rodrigues Dec 29 '18 at 12:19