I have a block of text containing messages. To recognize the start of each message I want to use the regular expression pattern "\[[0-9]{2}.{1}[0-9]{2}\]"
(.{1}
is because sometime there's a dash, sometime a colon, it's not sure since it's produced through OCR)
Now I want to split this block of text, and put each message in an array. I first tried re.split("\[[0-9]{2}.{1}[0-9]{2}\]", text)
and it works, however it doesn't include the separator \[[0-9]{2}.{1}[0-9]{2}\]
For example [00:10] Player: Hello
would be put in the array as Player: Hello
instead of [00:10] Player: Hello
So I tried re.findall("\[[0-9]{2}.{1}[0-9]{2}\].*", text)
and now it does includes the [00:10]
, but now the issue is that if the message spans across multiple lines, those additional lines ain't included.
[00:10] Player: Hello\n Humans!
would be inserted in the array as [00:10] Player: Hello
I tried adding re.DOTALL
because apparently it would help matching those \n
, so I tried re.findall("\[[0-9]{2}.{1}[0-9]{2}\].*", text, re.DOTALL)
and now the issue is that the message isn't separated. The input block of text is output as the same block of text inside the result array.
Here is the block of text for reference :
"""
[04:29] [All] Player1 (Hecarim); tariane test +
[04:41] [All] Player1 (Hecarim); ?ariane test
[05:07] [All] Player1 (Hecarim); ?ariane test
it haha
[07-04] Player1 (Hecarim)
dddddddddddddddddddddddddddddddddddddddddddddddddddddd
[07:08] Player1 (Hecarim)
ddddddddddddddddddddddddddddddddddddddddddddddddddddddddd
ddddddddddddddddddddd
"""
I've read this : https://stackoverflow.com/a/2136580/20250022 about using .split()
with capturing groups.
I tried re.split("(\[[0-9]{2}.{1}[0-9]{2}\])", text)
but the thing is the separator and the message are not included in the same array item. [00:10] Player1: Hello
becomes ["[00:10]", "Player1: Hello"]
And I want to find out how to do it cleanly, because I know I could merge those two arrays.