1

I have a block of text containing messages. To recognize the start of each message I want to use the regular expression pattern "\[[0-9]{2}.{1}[0-9]{2}\]" (.{1} is because sometime there's a dash, sometime a colon, it's not sure since it's produced through OCR)

Now I want to split this block of text, and put each message in an array. I first tried re.split("\[[0-9]{2}.{1}[0-9]{2}\]", text) and it works, however it doesn't include the separator \[[0-9]{2}.{1}[0-9]{2}\]

For example [00:10] Player: Hello would be put in the array as Player: Hello instead of [00:10] Player: Hello

So I tried re.findall("\[[0-9]{2}.{1}[0-9]{2}\].*", text) and now it does includes the [00:10], but now the issue is that if the message spans across multiple lines, those additional lines ain't included.

[00:10] Player: Hello\n Humans! would be inserted in the array as [00:10] Player: Hello

I tried adding re.DOTALL because apparently it would help matching those \n, so I tried re.findall("\[[0-9]{2}.{1}[0-9]{2}\].*", text, re.DOTALL) and now the issue is that the message isn't separated. The input block of text is output as the same block of text inside the result array.

Here is the block of text for reference :

"""
 [04:29] [All] Player1 (Hecarim); tariane test +

[04:41] [All] Player1 (Hecarim); ?ariane test

[05:07] [All] Player1 (Hecarim); ?ariane test
it haha

[07-04] Player1 (Hecarim)
dddddddddddddddddddddddddddddddddddddddddddddddddddddd
[07:08] Player1 (Hecarim)
ddddddddddddddddddddddddddddddddddddddddddddddddddddddddd
ddddddddddddddddddddd
 
"""

I've read this : https://stackoverflow.com/a/2136580/20250022 about using .split() with capturing groups.

I tried re.split("(\[[0-9]{2}.{1}[0-9]{2}\])", text) but the thing is the separator and the message are not included in the same array item. [00:10] Player1: Hello becomes ["[00:10]", "Player1: Hello"]

And I want to find out how to do it cleanly, because I know I could merge those two arrays.

lyeaf
  • 23
  • 6
  • This is kind of a lot of text to parse to understand what you're trying to do. I would help if you added your desired result with the given input. – Mark Feb 04 '23 at 04:54
  • @Mark I want the block of text in my message above to become exactly this (code indentation doesn't seem to work in messages so copy-paste it in your editor for better formatting) : [ "[04:29] [All] Player1 (Hecarim); tariane test +", "[04:41] [All] Player1 (Hecarim); ?ariane test", "[05:07] [All] Player1 (Hecarim); ?ariane test it haha", "[07-04] Player1 (Hecarim) dddddddddddddddddddddddddddddddddddddddddddddddddddddd", "[07:08] Player1 (Hecarim) ddddddddddddddddddddddddddddddddddddddddddddddddddddddddd ddddddddddddddddddddd" ] – lyeaf Feb 04 '23 at 04:57

1 Answers1

1

I found a solution !

I needed to use .split() and to include the separator pattern, I would need to put it in a lookahead group.

So re.split("\[[0-9]{2}.{1}[0-9]{2}\]", text) should have been re.split("(?=\[[0-9]{2}.{1}[0-9]{2}\])", text)

lyeaf
  • 23
  • 6