-1

This is the string I am working with: string =

'\n\n <!----><div class="screen-reader-text" ng-if="::(ctrl.messageViewModel.isChat || ctrl.messageViewModel.isReply)" role="heading" aria-level="5">\n\n\nADMIN_COMMAND STOP\n\n, reply from YATIN </div><!---->\n\n <!--Chat head-->\n <div class="media-left" ng-class="{ \'hide-media-left\' : ctrl.messageViewModel.editFormVisible }">\n <!-- Person icon -->\n \n <!---->\n \n </div>\n\n <div class="ts-message-thread-body align-item-left" data-tid="messageThreadBody" ng-class="{\'has-attachments\': ctrl.messageViewModel.hasAttachments} ">\n <!--EditMessage-->\n <!---->\n <!--EditMessage-->\n <!----><div id="messageBody" class="message-body message-body-width" ng-if="!ctrl.messageViewModel.editFormVisible" simple-mouseenter="!ctrl.isInteropChat &amp;&amp; ctrl.messageReactionsEnabled &amp;&amp; ctrl.showMessageActions($event, this)" ng-mouseleave="!ctrl.isInteropChat &amp;&amp; ctrl.messageReactionsEnabled &amp;&amp; ctrl.hoverOutMessageBodyHandler($event)">\n <!----><div class="message-body-top-row padded-content" ng-if="!ctrl.isHiddenByDlp" ng-class="{ \'unread-message\': ctrl.messageViewModel.isNewMessage,\n \'has-reactions\': ctrl.messageReactionsEnabled &amp;&amp; ctrl.messageViewModel.messageHasReaction}">\n <div class="top-row-text-container" ng-class="{\'single-line-truncation\': ctrl.messageReactionsEnabled &amp;&amp; ctrl.messageViewModel.isRightRail}">\n <!--Name-->\n <div class="ts-msg-name app-small-font" data-tid="threadBodyDisplayName" dir="auto">\n and so on...

The main part of interest is:>\n\n\nADMIN_COMMAND STOP\n\n, reply from(in the ), from which I want to get ADMIN_COMMAND STOP

The ADMIN_COMMAND STOP part can be of any length and can have numbers. Also, there can be several \ns before and after it.

Other inputs can have:

>\n\n\nADMIN_COMMAND REFRESH\n\n, reply from

>0, reply from

>\n\n\n\nADMIN_COMMAND STOP\n\n\n, reply from

The output I want to get:

ADMIN_COMMAND STOP

ADMIN_COMMAND REFRESH

0

I came up with this:

re.findall(">.*([A-Z 0-9]*).*, reply from",string,re.DOTALL)

My logic:

Check for one > then, zero or more of any character (including \n) then, find zero or more capital letter/digits and then again check for zero or more of any character (including \n)

1 Answers1

1

It does find a match, because it doesn't return an empty list:

>>> import re
>>> string = ">\n\n\n\nADMIN_COMMAND STOP\n\n\n, reply from"
>>> re.findall(">.*([A-Z 0-9]*).*, reply from",string,re.DOTALL)
['']

The problem is that the capturing group ([A-Z 0-9]*) matches zero characters, because all characters have already been consumed by the greedy .* before it.

You can fix it by using the negated character class [^A-Z 0-9] before the capture group. Now it doesn't match anything anymore, because the _ in ADMIN_COMMAND is not in the character class. After fixing that, it works as expected:

>>> re.findall(">[^A-Z 0-9_]*([A-Z 0-9_]*).*, reply from",string,re.DOTALL)
['ADMIN_COMMAND STOP']

Note that non-greedy matching, .*?, does not seem to have the desired effect in this case. Even if we put .*? both before and after the capture group, all characters end up being matched by the final .*? despite the greedy * in the middle:

>>> re.findall(">.*?([A-Z 0-9_]*).*?, reply from",string,re.DOTALL)
['']

I don't quite understand why.

Thomas
  • 174,939
  • 50
  • 355
  • 478
  • I had first accepted your answer because I checked it on `https://regex101.com/` but when I tried it in my code get this: `[' ']`... I think I will edit my question once more and add more of the html that I am giving as input... Please look at my question again once I have edited it... – Sabito stands with Ukraine Jun 09 '20 at 14:07
  • 1
    HTML wasn't mentioned before... do you know about [parsing HTML with regex](https://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags#1732454)? – Thomas Jun 09 '20 at 14:08
  • oh I didn't know that... – Sabito stands with Ukraine Jun 09 '20 at 14:11