2

I have the following regex

<div class="dotted-highlight">\s*[^<strong>]\s*(.*?)\s*[^</strong>]<ul>

and string

<div class="dotted-highlight">So, to sum up your own appearance:
<ul>
<li>Your hair must be neat and tidy.</li>

And I am trying to replace it with

<div class="dotted-highlight"><strong>$1</strong><ul>

Just that $1 returns the string but with the first letter omitted.

CURRENT OUTPUT:

---------------------------------------
                                      |
                                      V
<div class="dotted-highlight"><strong>o, to sum up your own appearance:</strong><ul>

EXPECTED OUTPUT

<div class="dotted-highlight"><strong>So, to sum up your own appearance:</strong><ul>

LIVE EXAMPLE

http://regexr.com/3c72m

Adrian
  • 2,273
  • 4
  • 38
  • 78

2 Answers2

2

[^<strong>] Is a character class that matches any one of the included characters (in this case, any character except <, s, t, r, o, n, g, or >). It's actually matching the S excluded from the capture, and hence it's not part of the text replaced using $1.

What I believe you're looking for is (?!<strong>). This is a negative lookahead that asserts it's not followed the literal <strong>.

Regex:

~<div class="dotted-highlight">\s*+(?!<strong>)(.*?)\s*<ul>~si

Or if you want to only exclude cases where <strong> strictly covers the whole text:
(?!<strong>[^<]*</strong>\s*<ul>).

Regex:

<div class="dotted-highlight">\s*+(?!<strong>[^<]*</strong>\s*<ul>)([^\s<]*+(?:(?!\s*<ul>)[\s<]+[^\s<]*)*+)\s*<ul>
  • [^\s<]*+(?:(?!\s*<ul>)[\s<]+[^\s<]*)*+ is a more efficient way to match anything
    except \s*<ul>

regex101 demo


As a side note, allow me to comment I focused on answering what was wrong with your code. This will work for the subject string you provided, but regex is not the right tool to parse HTML. You may be interested in reading How do you parse and process HTML/XML in PHP?

Community
  • 1
  • 1
Mariano
  • 6,423
  • 4
  • 31
  • 47
  • `\s*+` is `\s*` with a [possessive quantifier](http://www.regular-expressions.info/possessive.html). You should keep it, otherwise it may fail if there's a space before `` – Mariano Nov 14 '15 at 16:26
1

Has already been mentioned that you mistaken the character class. It's not recommended to parse html with regex but if it's not arbitrary and string always like this I think you could simply search for

<div class="dotted-highlight">\K[^<]+(?=<ul>)

and replace with

<strong>$0</strong>
  • \K resets beginning of reported match
  • [^<]+ matches one or more characters that are not <
  • (?=<ul>) until there is <ul> ahead

See demo at regex101

bobble bubble
  • 16,888
  • 3
  • 27
  • 46