I have strings that contain white-space and punctuation characters that I want to decompose into individual characters. However, I want to avoid breaking up certain sub-strings that have a specific meaning such as ...
or -
or \n\r
.
For example, the string \n\r\n... .. - -..
should be decomposed into
['\n', '\r\n', '...', '.', '.', ' - ', '-', '.', '.']
I've tried using re.findall()
and re.split()
but without success. The following produced so far the best results (I included only some of the delimiters here to make the example shorter):
delimiters = [
r'\r\n',
r'\n',
r' - ',
r' ',
r'...',
r'\.',
]
pattern = re.compile(r'(' + r'|'.join(delimiters) + r')')
match = pattern.split('\n\r\n... .. - -..')
print([m for m in match if m])
which gives
['\n', '\r\n', '...', ' ', '.. ', '- -', '.', '.']
This is very close but I don't understand why ..
and - -
are matched.