-2

I have the following HTML text:

> <div class=WordSection1><p class=MsoNormal dir=RTL><span lang=HE style='font-family:"Arial",sans-serif;color:#1F497D'>Hi</span><span dir=LTR style='color:#1F497D'><o:p></o:p></span></p><p class=MsoNormal dir=RTL><span dir=LTR style='color:#1F497D'><o:p>&nbsp;</o:p></span></p><div><div style='border:none;border-top:solid #E1E1E1 1.0pt;padding:3.0pt 0cm 0cm 0cm'><p class=MsoNormal dir=RTL style='line-height:normal'><b><span dir=LTR>From</span></b><span dir=RTL></span><b><span lang=HE><span dir=RTL></span>:</span></b><span lang=HE> </span><span dir=LTR>Some Guy</span><span dir=RTL></span><span lang=HE><span dir=RTL></span> <br></span><b><span dir=LTR>Sent</span></b><span dir=RTL></span><b><span lang=HE><span dir=RTL></span>:</span></b><span lang=HE> </span><span dir=LTR>Tuesday, October 16, 2018 5:02 PM</span><span lang=HE><br></span><b><span dir=LTR>To</span></b><span dir=RTL></span><b><span lang=HE><span dir=RTL></span>:</span></b><span lang=HE> </span><span dir=LTR>Other Guy</span><span dir=RTL></span><span lang=HE><span dir=RTL></span>‏ &lt;</span><span dir=LTR>otherguy@domain.com</span>

I am trying, using RegEx pattern to locate the part:

<span dir=LTR>From</span>

enter image description here

The RegEx pattern i am using is:

<span(.*?)>From</span>

The issue I am facing and wish to resolve is that the above pattern matches a larger portion of the text than the part I am trying to mark.

My question is, how, using regular expressions, can I locate the shortest possible match.

See the pic of the actual match (marked) and the desired match (double marked).

Corion
  • 3,855
  • 1
  • 17
  • 27
Egor
  • 151
  • 2
  • 8

1 Answers1

0

Regular expressions will always match the leftmost match. While you can make a regular expression start later, you cannot force a non-leftmost match.

In your case, you can make the match more specific, by disallowing any > (for example) after your <span :

<span[^>]*>From</span>

This will break if you have attributes that contain an (unescaped) >.

Also, you shouldn't use regular expressions for parsing HTML. See RegEx match open tags except XHTML self-contained tags

Corion
  • 3,855
  • 1
  • 17
  • 27