3

I would like to split a string into sections of numbers and sections of text/symbols my current code doesn't include negative numbers or decimals, and behaves weirdly, adding an empty list element on the end of the output

import re
mystring = 'AD%5(6ag 0.33--9.5'
newlist = re.split('([0-9]+)', mystring)
print (newlist)

current output:

['AD%', '5', '(', '6', 'ag ', '0', '.', '33', '--', '9', '.', '5', '']

desired output:

['AD%', '5', '(', '6', 'ag ', '0.33', '-', '-9.5']
ragardner
  • 1,836
  • 5
  • 22
  • 45
  • 1
    The pattern `'(-?[0-9\.]+)'` gives you your desired output but will also have a couple of empty strings – ryugie Apr 05 '17 at 17:13
  • @ryugie Thank you! Any idea why it is adding an empty string? – ragardner Apr 05 '17 at 17:15
  • 1
    Try `re.split(r'(-?\d*\.?\d+)', s)`, and to get rid of empty values use `filter(None, result)`. – Wiktor Stribiżew Apr 05 '17 at 17:19
  • 1
    @new_to_coding - It adds an empty string because you're splitting on digits, i.e. using digits as a delimiter, so the empty string is what's between the delimiters. The numbers show up on your list only because you wrapped your pattern in parentheses, so you're capturing the delimiters as well. – ryugie Apr 05 '17 at 17:23
  • @ryugie very interesting, thank you – ragardner Apr 05 '17 at 17:25
  • @WiktorStribiżew thank you for your response, your expression also seems to work. If either one of you would like to submit either your answer or both (with a short explanation of the others? if you want) with credit given I will accept it – ragardner Apr 05 '17 at 17:25

3 Answers3

4

Your issue is related to the fact that your regex captures one or more digits and adds them to the resulting list and digits are used as a delimiter, the parts before and after are considered. So if there are digits at the end, the split results in the empty string at the end to be added to the resulting list.

You may split with a regex that matches float or integer numbers with an optional minus sign and then remove empty values:

result = re.split(r'(-?\d*\.?\d+)', s)
result = filter(None, result)

To match negative/positive numbers with exponents, use

r'([+-]?\d*\.?\d+(?:[eE][-+]?\d+)?)'

The -?\d*\.?\d+ regex matches:

  • -? - an optional minus
  • \d* - 0+ digits
  • \.? - an optional literal dot
  • \d+ - one or more digits.
Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
2

Unfortunately, re.split() does not offer an "ignore empty strings" option. However, to retrieve your numbers, you could easily use re.findall() with a different pattern:

import re

string = "AD%5(6ag0.33-9.5"
rx = re.compile(r'-?\d+(?:\.\d+)?')
numbers = rx.findall(string)

print(numbers)
# ['5', '6', '0.33', '-9.5']
Jan
  • 42,290
  • 8
  • 54
  • 79
  • that's amazing, thank you, but not quite what I needed to do, but very useful for extracting numbers nonetheless – ragardner Apr 05 '17 at 17:27
1

As mentioned here before, there is no option to ignore the empty strings in re.split() but you can easily construct a new list the following way:

import re

mystring = "AD%5(6ag0.33--9.5"
newlist = [x for x in re.split('(-?\d+\.?\d*)', mystring) if x != '']
print newlist

output:

['AD%', '5', '(', '6', 'ag', '0.33', '-', '-9.5']