0

I have a regex which seems to stop working whatsoever if it is given the wrong input.

My code:

function dbStr($string)
{
    private static $tag = "(script|embed)";//As it turns out, embeds have the exact same syntax as scripts, so, we can use the same regexes against those :)
    private static $tvnc = "(\\\\'|\\\\\"|[^<>\"'/])*?";//Tag Valid No Close
    private static $quoteseq = "['\"](\\\\'|\\\\\"|[^\"'])*?['\"]";
    private static $tvncq = "(".$tvnc.$quoteseq.$tvnc.")*?";//Tag Valid No Close Quotes


    $string = preg_replace_callback
    (
        "#<".$tvnc.$tag."(".$tvncq."(src=".$quoteseq.")".$tvncq.")/>#imsSX",//Pattern
        "dbStr_FilterSinglematch",//Callback
        $string//Subject
    );
    return $string;
}

function dbStr_FilterSinglematch($m)
{
    print_r($m);
    return "";
}

Now, lets's say I call this input:

echo "\n" . dbStr
("







<script type='textjavascript' src='asdf'/>

    <script type='textjavascript' src='asdf'>
    asdfasfasdf


    uyoiyoiuyoiuy

");

It works fine! It finds a match, and removes that match. Here is the output that I am sent from that call:

Array
(
    [0] => <script type='textjavascript' src='asdf'/>
    [1] => 
    [2] => script
    [3] =>  type='textjavascript' src='asdf'
    [4] =>  type='textjavascript' 
    [5] => =
    [6] => t
    [7] =>  
    [8] => src='asdf'
    [9] => f
)











    <script type='textjavascript' src='asdf'>
    asdfasfasdf


    uyoiyoiuyoiuy

However, if I give it THIS input instead....

echo "test" . dbStr
(
'

<embed type="application/x-shockwave-flash" src="http://picasaweb.google.com/s/c/bin/slideshow.swf" width="288" height="192" flashvars="host=picasaweb.google.com&amp;hl=en_US&amp;feat=flashalbum&amp;RGB=0x000000&amp;feed=http%3A%2F%2Fpicasaweb.google.com%2Fdata%2Ffeed%2Fapi%2Fuser%2F109941697484668010012%2Falbumid%2F5561383933745906193%3Falt%3Drss%26kind%3Dphoto%26authkey%3DGv1sRgCN2H88H41qeT6AE%26hl%3Den_US" pluginspage="http://www.macromedia.com/go/getflashplayer"></embed>

'.

"



<script type='textjavascript' src='asdf'/>
<script  fubar=\"d\\\\\'erp\"  derp=\"dlerp\">
    //<script type='text/javascript' src='asdf'/>
    asdfasfasdf
</script>
<script>
    uyoiyoiuyoiuy
</script>
");

Nothing. Nothing at all. No matches are found, but the text I get out of the regex is completely blank!

I mean, seriously.... What the heck? THis is the output I get from running the above code:

test 

Yes, that's it.

If the regex had found any matches (like, say, matched the entire document for instance) then wouldn't it have outputted something form my print_r() call? No, I don't think it's even calling the callback. The regex is failing altogether.

Whats worse is, I have the following headers/ini settings set:

header('Content-type: text/plain');
error_reporting(E_ALL);
ini_set("display_errors", 1);

But I am not seeing ANY errors either in my log OR in the output its self!

So, there you have it, my regex predicament. Does anyone have any ideas as to why this is failing?

EDIT:

I have narrowed down the source of my problems:

echo "test " . dbStr
('<embed tests="abc" tests="abc" flashvars="AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA"></embed>');

It seems when I have two attributes about that long, and then a very long attribute, the system crashes. However, THIS input does not crash...: (there are more A's but no preceding tags)

echo "test " . dbStr
('<embed
flashvarsembed>');

That being stated, with the added A's the preceding tags now only need to be THIS long to crash it:

echo "test " . dbStr
('<embed a="b" c="d"
flashvarsembed>');

It seems that this is a memory related issue... Is there a fix? The code this will be parsing could be extremely long.

Georges Oates Larsen
  • 6,812
  • 12
  • 51
  • 67
  • 6
    Why are you trying to parse HTML with regex in the first place? ([The pony, he comes...](http://stackoverflow.com/a/1732454/1338999))...HTML is a non-regular language and thus is not able to be parsed by regular expressions. Instead make use of the DOM. – Matt Sep 04 '12 at 19:31
  • I'd like to remove untrusted script tags. I don't need full parsing of the HTML – Georges Oates Larsen Sep 04 '12 at 19:32
  • [Does this article help?](http://nadeausoftware.com/articles/2007/09/php_tip_how_strip_html_tags_web_page) I didn't read the whole thing, but it seems like it might be what you're looking for... – Matt Sep 04 '12 at 19:33
  • Also, this question may help you: http://stackoverflow.com/questions/7072327/strip-tags-function-blacklist-rather-than-whitelist – Matt Sep 04 '12 at 19:35
  • Your dbStr function isn't returning anything. I assume you want to `return $string` – tigrang Sep 04 '12 at 19:36
  • @Matt:That webpage told me to use a regex -- which is what I am doing, mine is just more complicated (to avoid malicious users trying to confuse the regex) EDIT: I think I may have misread the article, let me check again – Georges Oates Larsen Sep 04 '12 at 19:36
  • @tigrang oops, sorry, what i pasted is a somewhat simplified version of my actual code. Forgot to add a return statement – Georges Oates Larsen Sep 04 '12 at 19:37
  • @Matt: Alright, I had a closer look! THe problem is I need to check the src of script tags to see if they are from a trusted website. I fear using a DomDocument object because, A: This may not cover edge-case syntax on different browsers (that hackers could use to confuse the parser) and B: this is a full fledged parser... Do I really need to load THAT MUCH just to remove some script tags? I will consider rewriting the algorithm, but for now, I would like to know why my regex is failing – Georges Oates Larsen Sep 04 '12 at 19:44
  • 1
    I'm not an expert on regex, but I can assure you that trying to parse HTML using regular expressions will just cause headaches like this one. HTML is non-regular. It's the nature of the beast. – Matt Sep 04 '12 at 19:49
  • @Matt I just did some reading on why not use Regex on HTML. I understand now. This is a very rare case -- Script tags (as well as embed tags, right?) cannot contain other tags. Everythign between the mis treated as code, or not read at all. Therefore I *Can* get away with regexes for scripts, but ONLY scripts, and nothing more advanced. Please correct me if I am wrong, I al willing to learn. But I do think scripts are a rare occasion when I can break this mandate. I wil lprobably end up using DomDocument, but that doesn't kill my curiosity as to why the regex is just... crashing like this. – Georges Oates Larsen Sep 04 '12 at 19:50
  • I have a feeling it fails because of this line: `//` I think the regex sees a "script within a script". – Matt Sep 04 '12 at 19:56
  • @Matt The regex I have pasted here is one designed to find single-tag scripts. It will have already failed seeing the previous double-tag script-enter, as that lacks a "/" before its closing ">" However, I will still test it out to see if you are on to something – Georges Oates Larsen Sep 04 '12 at 20:00
  • @Matt removing that part of the code did not cause the error to go away... Let me play around here for a bit – Georges Oates Larsen Sep 04 '12 at 20:03
  • Ugh. It appears to be a length related issue... Edited the above post – Georges Oates Larsen Sep 04 '12 at 20:23

0 Answers0