1

I have a regex that works. However I want it to drop matches that have a specific word.

/\<meta[^\>]+(http\-equiv[^\>]+?refresh[^\>]+?(?<!\-)(?<!\d)[0-9]\d*[^\>]+?url[^\>]+?|(?<!\-)(?<!\d)[0-9]\d*[^\>]+?url[^\>]+?http\-equiv[^\>]+?refresh[^\>]+?)\/?\>/is

This matches the following: (http-equiv and url in any order)

  1. <meta http-equiv="refresh" content="21;URL='http://example.com/'" />
  2. <meta content="21;URL='http://example.com/'" http-equiv="refresh" />

I want to exclude any url that has ?PageSpeed=noscript

a. <meta content="21;URL='http://example.com/?PageSpeed=noscript'" http-equiv="refresh" /> b. <meta content="21;URL='http://example.com/segment?PageSpeed=noscript&var=value'" http-equiv="refresh" />

Any ideas are much appreciated. Thanks.

Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
Shivanand Sharma
  • 434
  • 3
  • 13
  • is that standard meta tags or just like format with semicolon after like in yours example? – Jerson Dec 26 '21 at 14:04
  • Standard meta tag that redirects the page. Essentially the regex detects if the page is redirecting to somewhere. So value for "content" has to be non-negative. And finally the URL must not include `?PageSpeed=noscript` – Shivanand Sharma Dec 26 '21 at 14:07
  • I would just use `str_contains` for this specific case because it is easier to spot and comment the exception. Not the answer you are looking for, probably, I understand. – Chris Haas Dec 26 '21 at 14:18
  • 1
    [You could use a negative *lookahead* (demo)](https://regex101.com/r/zSWTQ1/1). Probably a better idea to use a parser as mentioned. – bobble bubble Dec 26 '21 at 15:16
  • @bobblebubble Love it! How to make this an accepted answer? – Shivanand Sharma Dec 27 '21 at 06:06
  • @ShivanandSharma Was a bit busy and just read your comment late :) Glad it helped! – bobble bubble Dec 28 '21 at 16:21
  • @bobblebubble There's a technical requirement for the "content" value to be 0 or positive. Is it possible to ensure that it doesn't match `` ? Many thanks. – Shivanand Sharma Dec 29 '21 at 04:59
  • 1
    @ShivanandSharma Sure it's possible. See updated answer. – bobble bubble Dec 29 '21 at 14:15
  • @ShivanandSharma I put an answer as you asked for, no idea why not accepting so I removed it. [This was the last pattern](https://regex101.com/r/L8Nf2u/1) which was adjusted to your and mickymacs comment. – bobble bubble Jan 03 '22 at 18:05
  • 1
    @bobblebubble Pls restore your answer. I was kind of occupied. I'll mark as accepted. Thanks. – Shivanand Sharma Jul 05 '22 at 11:10
  • 1
    @ShivanandSharma I had restored it 3 minutes after you asked... nothing happened. I'm done here :) Also I don't think parsing html with regex is a good practice. It's solved, so no worries. – bobble bubble Jul 06 '22 at 11:00

1 Answers1

2

You may use the DOM Parser instead of regex.

<?php

$meta = '<meta content="21;URL=\'http://example.com/\'" http-equiv="refresh" /> <meta content="21;URL=\'http://example.com/?PageSpeed=noscript\'" http-equiv="refresh" />';

$dom = new DOMDocument;
$dom->loadHTML($meta);
$noPageScripts = [];

foreach ($dom->getElementsByTagName('meta') as $tag) {
  $content = $tag->getAttribute('content');
  // Match the URL
  preg_match('/URL=["\']?([^"\'>]+)["\']?/i',$content,$matches);

  if($tag->getAttribute('http-equiv') && isset($matches[1]) && stripos($matches[1],'?PageSpeed=noscript') === false) {
    $noPageScripts[] = [
      'originalTag' => $dom->saveHTML($tag),
      'url' => $matches[1]
    ];
  }
}

var_dump($noPageScripts);

Here's the fiddle

Jerson
  • 1,700
  • 2
  • 10
  • 14
  • I can use a dom parser however this is a malware scanner and it works by regex-matching in the source-code of the page. Technical limitation of my software. But I'm glad your reply is helpful for those looking for the DOM parser. – Shivanand Sharma Dec 27 '21 at 06:05