regex findall() to ignore ONE newline, but recognize more than one

Question

I have a text file that looks like this essentially:

Game #16406772158 starts.\n#Game No : 16406772158\n

....

wins $0.75 USD\n\n\n_

Lots of \n (new text) \n (new text) and then \n\n\n. I want to find all of the instances where this occurs in my text file. When my code looks like this, it works (but only for the first instance):

gameRegex = re.compile(r"""Game #(.+\n)*""") 
game = gameRegex.search(totalContent)

When I switch to the findall method, outputting the "game" variable looks like this:

['Yl9Ui1OhAPyGV0JlCPLRrg wins $0.75 USD\n',
  'G72AzGPQLTOWfYoNST1K/g wins $10 USD\n',
 '4bSQFjpEWTIcsil7GJkkVA wins $39.99 USD from the main pot with three of a kind, Kings.\n',
 'U3xFxCVFfFBt50sL9VgLgQ wins $1.45 USD\n', ..., ]

Very new to programming, I have no idea what to do here. I want it to look like this, where it creates a list. Within each item of the list, it displays the text up until the \n\n\n:

game = ['Game #16406772158 starts.\n#Game No : 16406772158\n***** Hand 
History for Game 16406772158 *****\n$50 USD NL Texas Hold'em - Wednesday, 
July 01, 00:00:01 EDT 2009 ... Yl9Ui1OhAPyGV0JlCPLRrg wins $0.75 USD\n', 
'Game #16406772158 starts.\n#Game No : 16406772158\n***** Hand History for 
Game 16406772158 *****\n$50 USD NL Texas Hold'em - Wednesday, July 01, 
00:00:01 EDT 2009 ... Yl9Ui1OhAPyGV0JlCPLRrg wins $0.75 USD\n']

Use `\n{2,}` to match 2 or more newlines – user3483203 Jun 14 '18 at 03:25 — user3483203, Jun 14 '18 at 03:25
Can you show your expected output? – Austin Jun 14 '18 at 03:34 — Austin, Jun 14 '18 at 03:34
Did you try a mere `re.split('\n{3,}', s)`? – Wiktor Stribiżew Jun 14 '18 at 08:16 — Wiktor Stribiżew, Jun 14 '18 at 08:16

score 1 · Answer 1 · answered Jun 14 '18 at 04:19

I think the pattern you are looking for goes like this:

(?:(?!\\n\\n\\n).)+\\n\\n\\n

Demo

To get rid of the two extra \n at the end of a list item use this regex instead:

(?:(?!\\n\\n\\n).)+\\n(?=\\n\\n)

Sample Code:

import re
regex = r"(?:(?!\\n\\n\\n).)+\\n(?=\\n\\n)"
test_str = ("Game #16406772158 starts.\\n#Game No : 16406772158\\n\n"
    "Yl9Ui1OhAPyGV0JlCPLRrg wins $0.75 USD\\nG72AzGPQLTOWfYoNST1K/g wins $10 USD\\n'4bSQFjpEWTIcsil7GJkkVA wins $39.99 USD from the main pot with three of a kind, Kings.\\n'U3xFxCVFfFBt50sL9VgLgQ wins $1.45 USD\\nwins $0.75 USD\\n\\n\\nGame #16406772158 starts.\\n#Game No : 16406772158\\n....\n"
    "wins $0.75 USD\\n\\n\\n\n"
    "Game #16406772158 starts.\\n#Game No : 16406772158\\n\n"
    "....\n"
    "wins $0.75 USD\\n\\n\\n")
result = []
matches = re.finditer(regex, test_str, re.DOTALL)
for match in matches:
    #print ("Match was found at {start}-{end}: {match}".format(start = match.start(), end = match.end(), match = match.group()))
    result.append(match.group())
print(result)

Output:

["Game #16406772158 starts.\\n#Game No : 16406772158\\n\nYl9Ui1OhAPyGV0JlCPLRrg wins $0.75 USD\\nG72AzGPQLTOWfYoNST1K/g wins $10 USD\\n'4bSQFjpEWTIcsil7GJkkVA wins $39.99 USD from the main pot with three of a kind, Kings.\\n'U3xFxCVFfFBt50sL9VgLgQ wins $1.45 USD\\nwins $0.75 USD\\n", '\\n\\nGame #16406772158 starts.\\n#Game No : 16406772158\\n....\nwins $0.75 USD\\n', '\\n\\n\nGame #16406772158 starts.\\n#Game No : 16406772158\\n\n....\nwins $0.75 USD\\n']

regex findall() to ignore ONE newline, but recognize more than one

1 Answers1