0

I have this code:

$tags = implode("|", array("a", "script", "link", "iframe", "img", "object"));
$attrs = implode("|", array("href", "src", "data"));
$any_tag = "\w+(?:\s*=\s*[\"'][^\"']*[\"'])?";
$replace = array(
    "/(<(?:$tags)(?:\s*$any_tag)*\s*(?:$attrs)=[\"'])(?![\"']?(?:data:|#))([^'\"]+)([\"'][^>]*>)/" => function($match) {
        return $match[1] . $match[2] . $match[3]; // return same data
    }
);
$page = preg_replace_callback_array($replace, $page);
echo $page;

and I'm runing this code against https://duckduckgo.com/d2038.js and $page is empty after executing replace, why? If I've added print_r($match); in callback I've got:

Array
(
    [0] => <a href='/a'>
    [1] => <a href='
    [2] => /a
    [3] => '>
)

the same happen if I assign the value of replace to another variable. Why the page is empty?

If I runing this in regex101 it match more elements https://regex101.com/r/CPGuKd/1 and it don't clear the output.

jcubic
  • 61,973
  • 54
  • 229
  • 402
  • 4
    You shouldn't parse HTML with regular expressions. Use the appropriate extensions instead (DOM, SimpleXML, XMLReader, etc.) – Ruslan Osmanov Dec 30 '16 at 10:11
  • 1
    [HTML can't be safely parsed](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags) – Peon Dec 30 '16 at 10:12
  • @RuslanOsmanov how I'm suppose to parse html inside javascript? – jcubic Dec 30 '16 at 10:14
  • For convenience, actual regex executed: /(<(?:a|script|link|iframe|img|object)(?:\s*\w+(?:\s*=\s*[\"'][^\"']*[\"'])?)*\s*(?:href|src|data)=[\"'])(?![\"']?(?:data:|#))([^'\"]+)([\"'][^>]*>)/ – Mr47 Dec 30 '16 at 10:16
  • @DainisAbols WTF, that is the weirdest SO answer I have ever seen... :) – Cagy79 Dec 30 '16 at 10:18
  • @revo $self is url with __proxy_url= param and proxy_url encode url in base64 and prepend base64:, same happen if I just put `return $match[1] . $match[2] . $match[3];` in callback. – jcubic Dec 30 '16 at 10:26
  • Does `$page` hold nothing or `NULL` (`var_dump($page)`)? – revo Dec 30 '16 at 10:27
  • @revo the result is NULL. – jcubic Dec 30 '16 at 10:29
  • @revo I've got PREG_BACKTRACK_LIMIT_ERROR. Any clues why I got this error. – jcubic Dec 30 '16 at 10:33
  • 1
    @RuslanOsmanov: unfortunately you can't use XMLReader for html (to be precise, you can only use it with a perfectly XML compliant xhtml document). – Casimir et Hippolyte Dec 30 '16 at 10:48
  • Hey @CasimiretHippolyte, STBU, but I like to bring your attention to [this question](http://stackoverflow.com/questions/41320032/regular-expression-with-counting). – revo Dec 30 '16 at 11:01
  • @revo: interesting, I posted an answer. – Casimir et Hippolyte Dec 30 '16 at 12:09

1 Answers1

1

The final cooked regex from within your code is this:

(<(?:a|script|link|iframe|img|object)(?:\s*\w+(?:\s*=\s*["'][^"']*["'])?)*\s*(?:href|src|data)=["'])(?!["']?(?:data:|#))([^'"]+)(["'][^>]*>)

which is different from your live demo and causes a catastrophic backtracking.

According to your live demo there should be a little change in PHP code:

"/(<(?:$tags)(?:\s*$$any_tag)*...
                   ^
revo
  • 47,783
  • 14
  • 74
  • 117
  • Since on iterating over input string regex engine would fail sooner if it has to while using a `$` (end of input string / line). Yes, it doesn't make sense without analyzing regex itself so I'd go with a possessive quantifier `\s++`. Now makes sense. – revo Dec 30 '16 at 10:49
  • I've tested again and it don't work, it don't match `` – jcubic Dec 30 '16 at 10:51