Regex to match the first ending HTMl tag

Question

I am trying to write a regex which match the first ending form tag.

  <form.*name="loginForm".*>[^~]*</form>

The above regex matches till the second from ends i.e till line 8. but I want a regex that matches the immediate ending from tag in the below example it should match line 5.

<html>
<body>
<form method = "post" name="loginForm" >
<input type="text" name="userName"/>
</form>
<form method = "post" name="signupForm" >
<input type="text" name="userName"/>
</form>
</body>
</html>

Thanks for the quick response and suggestion. I found my answer. Special thanks to Guffa. — , Sep 22 '09 at 14:57

Guffa · Answer 1 · 2009-09-22T07:21:49.347

11

Just make the pattern non-greedy so that it matches the smallest possible amount of characters instead of the largest possible:

<form[^>]*name="loginForm"[^>]*>[^~]*?</form>

Edit:
Changed .* to [^>]* in the form tag, so that it doesn't match outside the tag.

edited Sep 22 '09 at 07:21

answered Sep 22 '09 at 06:00

Guffa

687,336
108
737
1,005

4

Will fail for nested tags. Not sure that forms will ever be nested, but using a regex to parse HTML is still a bad idea, even if it works in some select cases. – Chris Lutz Sep 22 '09 at 06:07
@Guffa: Then you should make all quantifiers non-greedy. – Gumbo Sep 22 '09 at 06:15
The question didn't ask for the matching form tag, it asked for the first, which might not be the best thing to do, but this is a valid and useful trick sometimes. – Kevin Peterson Sep 22 '09 at 06:17
@Chris: Form tags can't be nested. – Guffa Sep 22 '09 at 07:01
@Kevin: The first ending tag is the matching tag. – Guffa Sep 22 '09 at 07:04
@Gumbo: Good point, however they don't have to be non-greedy, they just have to be kept inside the tag. – Guffa Sep 22 '09 at 07:22
@Gumbo: Actually, if the expression before the name attribute would be just non-greedy, it fails if the loginForm is not the first form in the code... – Guffa Sep 22 '09 at 07:37

score 2 · Answer 2 · answered Sep 22 '09 at 05:40

2

Use a real parser like DOMDocument, SimpleXML or SimpleHTMLDOM. Regular expressions are not suitable for parsing non-regular languages like HTML.

answered Sep 22 '09 at 05:40

Gumbo

643,351
109
780
844

3

+1 Was going to say it, but I had to make the question presentable first. Let's link to the explanation of why (http://stackoverflow.com/questions/701166/can-you-provide-some-examples-of-why-it-is-hard-to-parse-xml-and-html-with-a-rege) and the example of parsers (http://stackoverflow.com/questions/773340/can-you-provide-an-example-of-parsing-html-with-your-favorite-parser) for completeness. – Chris Lutz Sep 22 '09 at 05:42
1

A regular expression works just fine for parsing a string like this. There is definitely no need for it to be a regular language to be parsed by a regular expression. – Guffa Sep 22 '09 at 09:53

score 2 · Answer 3 · answered Sep 22 '09 at 05:41

You should NOT use regular expressions, but parse it with DOM:

Javascript:

var forms = document.getElementsByTagName('form');
forms[0] // is the first form element.

PHP:

$dom = new DOMDocument();
$dom->loadHTML( $html );
$forms = $dom->getElementsByTagName('form');
$first = $forms->item(0); // reference to first form

You can use minidom and ElementTree for Python.

Regex to match the first ending HTMl tag

3 Answers3

Linked