re.DOTALL isn't selecting newline character

Question

This is the string I am working with: string =

'\n\n <div class="screen-reader-text" ng-if="::(ctrl.messageViewModel.isChat || ctrl.messageViewModel.isReply)" role="heading" aria-level="5">\n\n\nADMIN_COMMAND STOP\n\n, reply from YATIN </div>\n\n \n <div class="media-left" ng-class="{ \'hide-media-left\' : ctrl.messageViewModel.editFormVisible }">\n \n \n \n \n </div>\n\n <div class="ts-message-thread-body align-item-left" data-tid="messageThreadBody" ng-class="{\'has-attachments\': ctrl.messageViewModel.hasAttachments} ">\n \n \n \n <div id="messageBody" class="message-body message-body-width" ng-if="!ctrl.messageViewModel.editFormVisible" simple-mouseenter="!ctrl.isInteropChat && ctrl.messageReactionsEnabled && ctrl.showMessageActions($event, this)" ng-mouseleave="!ctrl.isInteropChat && ctrl.messageReactionsEnabled && ctrl.hoverOutMessageBodyHandler($event)">\n <div class="message-body-top-row padded-content" ng-if="!ctrl.isHiddenByDlp" ng-class="{ \'unread-message\': ctrl.messageViewModel.isNewMessage,\n \'has-reactions\': ctrl.messageReactionsEnabled && ctrl.messageViewModel.messageHasReaction}">\n <div class="top-row-text-container" ng-class="{\'single-line-truncation\': ctrl.messageReactionsEnabled && ctrl.messageViewModel.isRightRail}">\n \n <div class="ts-msg-name app-small-font" data-tid="threadBodyDisplayName" dir="auto">\n and so on...

The main part of interest is:>\n\n\nADMIN_COMMAND STOP\n\n, reply from(in the ), from which I want to get ADMIN_COMMAND STOP

The ADMIN_COMMAND STOP part can be of any length and can have numbers. Also, there can be several \ns before and after it.

Other inputs can have:

>\n\n\nADMIN_COMMAND REFRESH\n\n, reply from

>0, reply from

>\n\n\n\nADMIN_COMMAND STOP\n\n\n, reply from

The output I want to get:

ADMIN_COMMAND STOP

ADMIN_COMMAND REFRESH

0

I came up with this:

re.findall(">.*([A-Z 0-9]*).*, reply from",string,re.DOTALL)

My logic:

Check for one > then, zero or more of any character (including \n) then, find zero or more capital letter/digits and then again check for zero or more of any character (including \n)

Maybe like this: [`^>\s*(\w+(?: \w+)?)\s*, reply from`](https://regex101.com/r/p7QsU5/2/) ? — Jan, Jun 09 '20 at 13:30
@Jan I tried copying `^>\s+([A-Z]\w+ \w+)\s+` into my code but it doesn't gives output... Maybe instead of actually putting a new line the string should have `\n`... — Sabito stands with Ukraine, Jun 09 '20 at 13:33
Do you have the `multiline` mode on? Do your lines start with `>` or is there any whitespace before? — Jan, Jun 09 '20 at 13:35
@Jan I haven't modified any modes... they are what they are by default... — Sabito stands with Ukraine, Jun 09 '20 at 13:49

score 1 · Answer 1 · answered Jun 09 '20 at 13:56

It does find a match, because it doesn't return an empty list:

>>> import re
>>> string = ">\n\n\n\nADMIN_COMMAND STOP\n\n\n, reply from"
>>> re.findall(">.*([A-Z 0-9]*).*, reply from",string,re.DOTALL)
['']

The problem is that the capturing group ([A-Z 0-9]*) matches zero characters, because all characters have already been consumed by the greedy .* before it.

You can fix it by using the negated character class [^A-Z 0-9] before the capture group. Now it doesn't match anything anymore, because the _ in ADMIN_COMMAND is not in the character class. After fixing that, it works as expected:

>>> re.findall(">[^A-Z 0-9_]*([A-Z 0-9_]*).*, reply from",string,re.DOTALL)
['ADMIN_COMMAND STOP']

Note that non-greedy matching, .*?, does not seem to have the desired effect in this case. Even if we put .*? both before and after the capture group, all characters end up being matched by the final .*? despite the greedy * in the middle:

>>> re.findall(">.*?([A-Z 0-9_]*).*?, reply from",string,re.DOTALL)
['']

I don't quite understand why.

I had first accepted your answer because I checked it on `https://regex101.com/` but when I tried it in my code get this: `[' ']`... I think I will edit my question once more and add more of the html that I am giving as input... Please look at my question again once I have edited it... — Sabito stands with Ukraine, Jun 09 '20 at 14:07
HTML wasn't mentioned before... do you know about [parsing HTML with regex](https://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags#1732454)? — Thomas, Jun 09 '20 at 14:08

re.DOTALL isn't selecting newline character

1 Answers1