'person alpha |person beta |\*|\n'
There are multiple things wrong with this attempt. First off: the pattern passed to re.split
is supposed to match the delimiters, not the items. That is, parts of the text that will not appear inside any of the items in the result list, but instead are in between:
>>> re.split('delimiter', 'foodelimiterbardelimiterbaz')
['foo', 'bar', 'baz']
Aside from that, person alpha
and person beta
in the text are followed consistently by a colon, not a space; so those alternatives don't ever match. \*
would match a literal asterisk (since it's escaped); but there's no apparent reason to look for that. \n
is a literal newline in this regex; as it happens, Python's regex engine will accept a newline in the regex string and treat it as if it were backslash-n escape sequence - but it's important to understand in general that there are two layers of escaping going on here.
Anyway, the point is: this regex matches against a single newline in the input, and also matches some other possibilities that never come up. Then, re.split
returns a list of things that are between those matches - i.e., the lines of the input.
Now, I would like to split the string on person alpha and person beta, so that the resulting list looks as follows
Generally, we say "split on X" to mean that X is the delimiter. Since person alpha
and person beta
are both things that should appear at the beginnings of results, they are not delimiters.
Instead, the delimiter we are looking for is the word boundary before those phrases.
When we look for that delimiter, we want to make sure that it is followed by the person identifier (so that we know that it's the delimiter), but the regex needs to not match that identifer. To address this, we use positive lookahead.
We want: a word boundary (\b
), with a positive lookahead ((?=...)
) for person
, followed by one of the person names, followed by a colon. To simplify, I'll assume that the person name can be anything after the word person
, and shouldn't be restricted to alpha
and beta
.
So the lookahead should match person.*:
, meaning the entire lookahead clause is (?=person.*:)
. The entire regex is \b(?=person.*:)
, and we use a raw string for this, so that the backslash is understood literally by Python and passed literally to the regex engine (which will do its own interpretation of the \b
sequence, instead of Python's).
Putting it together:
>>> re.split(r'\b(?=person.*:)', s)
['', 'person alpha:\nHow are you today?\n\n', "person beta:\nI'm fine, thank you.\n\n", "person alpha:\nWhat's up?\n\n", 'person beta:\nNot much, just hanging around.']
Notice that that left an empty string at the beginning of the output list. That's because the delimiter that we're looking for is at the beginning of the input. re.split
gives us whatever's before, between and after the delimiters. Before the first delimiter, in our case, is an empty string.
To avoid this, one simple approach is to recast the problem. Instead of searching for the points between the dialog items, we'll search for the dialog items themselves (it doesn't matter that there isn't any text in between them).
Each item looks like person
, a name, :
, whatever text, and two newlines - as a regex, person.*?:.*?\n\n
. Because the regex will now actually match text rather than just looking ahead, it's important to use reluctant qualifiers - the ?
s in that regex.
Then, we use that regex with re.findall
. It needs to use a raw string again, and we also need to use the re.DOTALL
option for the regex, to tell the regex engine that .
should be able to match a newline. (Otherwise, the regex would fail, because the second .*?
won't match the single newlines within each dialog item before the double newline is reached.)
Putting it together:
>>> re.findall(r'person.*?:.*?\n\n', s, flags=re.DOTALL)
['person alpha:\nHow are you today?\n\n', "person beta:\nI'm fine, thank you.\n\n", "person alpha:\nWhat's up?\n\n"]
There are many other ways to write the regex, depending on how the requirements are interpreted. For example, rather than matching a word boundary (\b
) with the re.split
approach, we could look for a beginning-of-line anchor (^
). In fact, we don't need anything besides the lookahead pattern, as long as we don't mind splitting the text anywhere that it says person someone:
(even if it isn't at the beginning of a line (^
), isn't at the beginning of a word (\b
), or whatever else). With the re.findall
approach, on the other hand, we could exclude the \n\n
from the matches by checking for it with lookahead.
But if the items are always separated by \n\n
, and it isn't really necessary to verify that they start with a person label, we could just split the text on that literal sequence. That doesn't even require regex:
>>> s.split('\n\n')
['person alpha:\nHow are you today?', "person beta:\nI'm fine, thank you.", "person alpha:\nWhat's up?", 'person beta:\nNot much, just hanging around.']