0

I find that preg_match_all and preg_replace do not find the same matches based on the same pattern.

My pattern is:

/<(title|h1|h2|h3|h4|h5|ul|ol|p|figure|caption|span)(.*?)><\/(\1)>/

When I run this against a snippet containing the likes of

<span class="blue"></span> 

with preg_match_all I get 17 matches.

When I use the same pattern in preg_replace I get 0 matches. Replacing the \1 with the selection list does find the matches, but of course that won't work as a solution because it then doesn't ensure that the closing tag is the same type of the opening tag.

The overall goal is to find instances of tags with no content that should not be present without content...a holy crusade, I assure you.

In testing whether the regex works, I have also tried it in php cli. Here is the output:

Interactive shell

php > $str = 'abc<span class="blue"></span>def';
php > $pattern = "/<(title|h1|h2|h3|h4|h5|ul|ol|p|figure|caption|span)(.*?)><\/(\1)>/";
php > $final = preg_replace($pattern, '', $str);
php > print $final;
abc<span class="blue"></span>def
JAyenGreen
  • 1,385
  • 2
  • 12
  • 23
  • (.*?) always seems to cause problems. Change that to: ([^>]+) meaning at least 1 non greater than. If that works, let me know and I'll write up a more complete answer. – sniperd Sep 12 '17 at 18:20
  • If I understood correctly, I changed the pattern to: "/<($search)([^>]+)><\/(\1)>/i" which resulted in no preg_match_all matches. I noted that it required something in the tag other than the tag name, which isn't necessarily the case, could be just , so I changed it to "/<($search)([^>]*)><\/(\1)>/i" but still no matches. – JAyenGreen Sep 12 '17 at 18:39
  • 1
    ZA̡͊͠͝LGΌ, H̸̡̪̯ͨ͊̽̅̾̎Ȩ̬̩̾͛ͪ̈́̀́͘ ̶̧̨̱̹̭̯ͧ̾ͬC̷̙̲̝͖ͭ̏ͥͮ͟Oͮ͏̮̪̝͍M̲̖͊̒ͪͩͬ̚̚͜Ȇ̴̟̟͙̞ͩ͌͝S̨̥̫͎̭ͯ̿̔̀ͅ https://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags – Will Barnwell Sep 12 '17 at 18:42
  • Very entertaining, but the same would be an issue if I were to replace the angle brackets with some random delimiter thus making it NOT HTML , so not very helpful :-) – JAyenGreen Sep 12 '17 at 18:46
  • 1
    Incorrect my friend, the reason you got 17 matches is because you wrote a regex that relied on html structuring and matched it in a way you did not intend. You are trying to search DOM using regex, and this is unholy and wrong. Checkout my answer for a mature explanation. – Will Barnwell Sep 12 '17 at 18:50
  • Well, it is a string, not DOM, but are you saying that the preg_match_all was incorrect? I ask because there were actually 17 span tags with no content in the document. – JAyenGreen Sep 12 '17 at 18:54
  • So are you looking for empty tags? Since you didn't mention that I assumed you were pursuing unholy ends – Will Barnwell Sep 12 '17 at 18:56
  • Ah, sorry, yes. Looking for instances of tags that shouldn't be present if they have no content. Description duly edited. I could use php string functions, or DOM methods, but both would be really tedious. – JAyenGreen Sep 12 '17 at 18:57
  • Not really that tedious, and far safer than regex, i found [a question](https://stackoverflow.com/questions/1896081/php-lib-for-parsing-html-to-dom-hierarchy-tree) which recommends some tools for working with DOM through PHP – Will Barnwell Sep 12 '17 at 19:00
  • Thanks for that. Well, I suppose I can just do a find on each of the tags and remove the ones without text nodes. I will say, however, that my question being -1 reflects the ennui that continues to allow regex to fail in this regard. Someone who didn't recognize the context as being HTML would have no idea why it doesn't, because it should, even if it is not the Best tool. If you'll post the link to your great diatribe as the answer, I'll select it :-) – JAyenGreen Sep 12 '17 at 19:06
  • Can you imagine, i wasn't even the downvote. Regex is the wrong tool to parse HTML/XML/any tree structured markup language, not just "not the best" – Will Barnwell Sep 12 '17 at 19:13
  • Also testing [here](http://infoheap.com/php-preg_replace-online/) Your regex works with replace, so I suspect there is some problem outside of the information provided in your question – Will Barnwell Sep 12 '17 at 19:14
  • Not that I can see. Please see the php cli test added to the description. – JAyenGreen Sep 12 '17 at 20:35

1 Answers1

1
$str = 'abc<span class="blue"></span>def';
$pattern = "/<(title|h1|h2|h3|h4|h5|ul|ol|p|figure|caption|span)(.*?)><\/(\\1)>/";
                                                              // added \  ^
$final = preg_replace($pattern, '', $str);
print $final;
// echos 'abcdef'

explanation:

"\1" // <-- character in octal notation

is very different from

'\1' // <-- backslash and 1

because the first is an escape sequence. this is also the reason I almost exclusively use single quoted strings. see http://php.net/string#language.types.string.syntax.double

Jakumi
  • 8,043
  • 2
  • 15
  • 32