4

I have a piece of code which splits a string after commas and dots (but not when a digit is before or after a comma or dot):

text = "This is, a sample text. Some more text. $1,200 test."
print re.split('(?<!\d)[,.]|[,.](?!\d)', text)

The result is:

['This is', ' a sample text', ' Some more text', ' $1,200 test', '']

I don't want to lose the commas and dots. So what I am looking for is:

['This is,', 'a sample text.', 'Some more text.', '$1,200 test.']

Besides, if a dot in the end of text it produces an empty string in the end of the list. Furthermore, there are white-spaces at the beginning of the split strings. Is there a better method without using re? How would you do this?

Patsy Issa
  • 11,113
  • 4
  • 55
  • 74
Johnny
  • 173
  • 1
  • 3
  • 12

1 Answers1

9

Unfortunately you can't use re.split() on a zero-length match, so unless you can guarantee that there will be whitespace after the comma or dot you will need to use a different approach.

Here is one option that uses re.findall():

>>> text = "This is, a sample text. Some more text. $1,200 test."
>>> print re.findall(r'(?:\d[,.]|[^,.])*(?:[,.]|$)', text)
['This is,', ' a sample text.', ' Some more text.', ' $1,200 test.', '']

This doesn't strip whitespace and you will get an empty match at the end if the string ends with a comma or dot, but those are pretty easy fixes.

If it is a safe assumption that there will be whitespace after every comma and dot you want to split on, then we can just split the string on that whitespace which makes it a little simpler:

>>> print re.split(r'(?<=[,.])(?<!\d.)\s', text)
['This is,', 'a sample text.', 'Some more text.', '$1,200 test.']
Andrew Clark
  • 202,379
  • 35
  • 273
  • 306