1

I have strings that contain white-space and punctuation characters that I want to decompose into individual characters. However, I want to avoid breaking up certain sub-strings that have a specific meaning such as ... or - or \n\r.

For example, the string \n\r\n... .. - -.. should be decomposed into

['\n', '\r\n', '...', '.', '.', ' - ', '-', '.', '.']

I've tried using re.findall() and re.split() but without success. The following produced so far the best results (I included only some of the delimiters here to make the example shorter):

delimiters = [
    r'\r\n',
    r'\n',
    r' - ',
    r' ',
    r'...', 
    r'\.',
]
pattern = re.compile(r'(' + r'|'.join(delimiters) + r')')
match = pattern.split('\n\r\n... .. - -..')
print([m for m in match if m])

which gives

['\n', '\r\n', '...', ' ', '.. ', '- -', '.', '.']

This is very close but I don't understand why .. and - - are matched.

Konstantin
  • 2,451
  • 1
  • 24
  • 26

0 Answers0