-3

-------------test.hta file code ------------

<!DOCTYPE html>
<html>
<head>
<title>dead</title>
</head>
<body>
txt<textarea id="content" >
            <input name="" type="text" class="qu_te1n05ew" value="请输入您的E-mail地址" />
           <input name="" type="submit" class="qu_sbt02" value="提 交" />
           </textarea>
<button onclick="startCls();">start</button>

<script>
function getObj(id) {
    return 'string' == typeof id ? document.getElementById(id) : id;
}

function startCls() {
    var txt = getObj('content').value;
    var srcRe = /<\w+(?:\s[^<>]*(?:(?:'[^']*')|(?:"[^"]*"))?[^<>]*)*\s+src\s*\=\s*["']?(?:[^"' <>]*\/)?([^\/"'<>]+\.(?:gif|jpg|png))['" ](?:\s[^<>]*(?:(?:'[^']*')|(?:"[^"]*"))?[^<>]*)*\/?>/ig;
    alert(srcRe.exec(txt));
}
</script>
</body>
</html>

------------code end-------

why srcRe.exec(txt) loop and the hta is dead?but other test string it will work.

the srcRe my mean is get a img tagname's src,and split it to get filename,but don't get no tagname's src,like <b><img src="ss.gif" </b>,because it isn't a html tagname.have not end >;

this synax (?:\s[^<>]*(?:(?:'[^']*')|(?:"[^"]*"))?[^<>]*)*,the mean is if have a < or > ,it must be in the '' or "",and other string must be not < or >;and is start by <,end by >;

qidizi
  • 331
  • 1
  • 3
  • 10
  • Now thats a comment that needs a double vote! – gideon Feb 05 '12 at 15:55
  • 1
    [`TEXTAREA`](http://www.w3.org/TR/html4/interact/forms.html#edef-TEXTAREA) does only allow parsed character data but not any other markup. – Gumbo Feb 05 '12 at 15:56
  • Sorry, but anyone who uses a regex like you've got deserves the consequences. That is the least understandable line of code I've seen in weeks. I'd strong suggest you do it another way (write actual JS to parse it) that's a lot more readable, maintainable and won't have the issue you're having. – jfriend00 Feb 05 '12 at 16:19

1 Answers1

2

I'm not going to debug this ghastly regex. But I can tell you why it fails. Breaking it down for "readability":

<
\w+
(?:\s[^<>]*(?:(?:'[^']*')|(?:"[^"]*"))?[^<>]*)*
\s+src\s*\=\s*["']?
(?:[^"' <>]*\/)?
([^\/"'<>]+\.(?:gif|jpg|png))
['" ]
(?:\s[^<>]*(?:(?:'[^']*')|(?:"[^"]*"))?[^<>]*)*
\/?
>

You can see that this can only match if there is a .gif or .jpg or .png in your string. Which it isn't, so the regex has to fail.

The problem now is that the regex engine takes a long time to figure this out because there are several instances of [^<>]* in your string, all of which can (and will try to) match the entire tag's contents, and (to add insult to injury) all of which are even enclosed in repeating groups. See line 3, broken down:

(?:
 \s
 [^<>]*       # optional!
 (?:
  (?:'[^']*')
  |
  (?:"[^"]*")
 )?           # optional!
 [^<>]*       # optional!
)*            # optional!

There are gazillions of permutations that the regex engine all has to check before being able to declare failure. In short, it's not an infinite loop, but a regex like this with input like that will like keep your computer busy until hell freezes over.

Hint 1: Read this tutorial on catastrophic backtracking.
Hint 2: Don't use regexes to parse HTML. At least not if you don't know exactly what you're doing.

Community
  • 1
  • 1
Tim Pietzcker
  • 328,213
  • 58
  • 503
  • 561
  • I mean write a regular expression to get `` tag's `s.gif`,some times maybe like `` too,but don't get `` can in the tag,but now i know it is difficult to do it by `regExp`,in my project, i can write a simple regexp to do what i want, and it run better. my english is so bad,the web of `catastrophic backtracking`,__see next comment – qidizi Feb 06 '12 at 07:21
  • i only can understand more then half;if you want to get string like `` from html string,except,use `for` and `if` to find the `s`,then `r`,then `c`...and go back find `<`,and determine if the `<` in the `"" or ''`...,i would like to know, what do you do it, can anyone show me some example?because i can't easy to get substring like `` by regexp.thanks. – qidizi Feb 06 '12 at 07:22
  • another way is `innnerHTML` then `getElementsByTagName('*')`,and use `for` to determine whice element is a img,then get it's src; – qidizi Feb 06 '12 at 07:25