Regular Expressions to match an tag

Question

I am writing a small java program for a class, and I can't quite figure out why my regex isn't working properly. In the special case of having 2 tags on the same line that is read in, it only matches the second one.

Here is a link that has the regex included, along with a simple set of test data: Regex Test Link.

In my java program I have the following code:

Pattern p = Pattern.compile(regex, Pattern.CASE_INSENSITIVE);
String[] results;
System.out.println(p.toString());
Matcher m = null;
        
while((line = input.readLine()) != null) {
    m = p.matcher(line);
    while(m.find()) {
        System.out.println("Matches: " + m.group(1));
    }
}

The goal is to extract the href value, as long as it starts with http://, the website ends in either no page (like http://www.google.com) or ends in index.htm or index.html (like http://www.google.com/index.html).

My regex works for every case of the above, but doesnt match in the special case of 2 tags that are on the same line.

Any help is appreciated.

Please see this answer: http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 — bdonlan, Nov 05 '11 at 04:37
Also: http://stackoverflow.com/questions/238036/java-html-parsing — bdonlan, Nov 05 '11 at 04:38
The actual regex can be seen on the test page that is linked above the code. It was easier to show it that way rather than paste it in. Plus it allows you to see what is working and may make it easier to edit. — Eric Reynolds, Nov 05 '11 at 04:47
Related: [How to extract links from HTML?](http://stackoverflow.com/questions/3394298/full-link-extraction-using-java) — BalusC, Nov 05 '11 at 20:28

score 1 · Answer 1 · answered Nov 05 '11 at 05:58

Just use a proper HTML parsing library, such as HTML cleaner. It is theoretically impossible to properly parse HTML with a regex - there are so many constructs that will confound it. For example:

<![CDATA[ > <a href="http://foo.com">bar</a> ]]>

This is not a link. This is literal text in XHTML.

<a href="http://bar.com/?<a href=http://foo.com>bar</a>">baz</a>

This is only one link.

<a rel="next" href="bar?2">Next</a>

This is a realistic example of a link with a relation attribute and a relative URI.

<a name="foo">The href="http://example.com" part is the link destination...</a>

This is a named anchor, not a link. However your regex would parse out the literal text here as a link.

<a
href="http://example.com">Foo</a>

Does your regex handle line-spanning links properly?

There are all kinds of other Fun edge cases that can occur. Save yourself time and headaches. These problems have already been solved and wrapped up in nice neat libraries for you to use. Take advantage of this.

Regexes may be a powerful tool, but as they say - when all you have is a hammer, everything looks like a nail. You are currently trying to hammer in a screw.

I understand your points and they are all valid. But this is for a class where I was told to specifically use regex to parse, and I was given very specific qualifications of how the links will look. Also, specifically stated that I cannot use outside libraries. — Eric Reynolds, Nov 05 '11 at 06:13
Regex are a good "quick and dirty" solution to simple non-nested tags. He's not trying to parse every possible variation of a syntactically valid href tag, but rather a small subset. No need to find a multipurpose electric screwdriver that fits every possible screw ever to exist for a single screw. — aleph_null, Nov 05 '11 at 13:15
@EricReynolds, ah, if you've been told not to use external libraries that's different, I guess... — bdonlan, Nov 05 '11 at 15:11

score 0 · Answer 2 · answered Nov 05 '11 at 04:40

0

This worked for me in that regex tester page

<a[^>]*>[^<]*</a>

answered Nov 05 '11 at 04:40

aleph_null

5,766
2
24
39

1

P.S. Jesus people, there's nothing wrong with using regex for simple and non-nested tags. Using a full blown parser would be complete overkill for these cases. – aleph_null Nov 05 '11 at 04:42
Yes that works, but I need to pull out the link in the href, not just ignore it. Therefore I need to group the href to pull it out. – Eric Reynolds Nov 05 '11 at 04:44
Something like this? [^<]* You may need to extract more fields, but you can tweak the regex as necessary. – aleph_null Nov 05 '11 at 04:55
The thing is, my regex works. I can run it on links that are on individual lines and my java program grabs the href information without any issue. It is just when 2 of them are on the same line that is read in. In that instance it only finds the second one and not the first one (even though both are valid for the match). So I think this is more of a Java based usage question, or just an oversight in my regex. In the test site though, both href's are identified. – Eric Reynolds Nov 05 '11 at 04:59
The regex that's loaded into the site you linked to matches everything from the start tag of the first to the end tag of the second ... It's a regex issue, not java. – aleph_null Nov 05 '11 at 05:19
Its not an "issue" its the correct use of quantifiers in a regex, see http://download.oracle.com/javase/tutorial/essential/regex/quant.html – mazaneicha Nov 05 '11 at 05:24
The regex posted in the link above matches everything from the opening tag of the first href to the closing tag of the second href. Here's why: " – aleph_null Nov 05 '11 at 05:26
@mazaneicha incorrectly using quantifiers in a regex is an issue. – aleph_null Nov 05 '11 at 05:28
@aleph_null hahaha agreed. incorrectly using any thing is an issue of the user not the thing itself. – mazaneicha Nov 05 '11 at 05:50
1

@aleph_null, a tags can be quoted using 's or even not quoted at all, you know... They can also contain embedded >s. a simplistic regex like that is not sufficient. No regex is sufficient. Just use a proper HTML parsing library, it's not like Java has any shortage of them. – bdonlan Nov 05 '11 at 05:53
I have solved the issue. I would post the solution myself but I have to wait 8 hours because of my rep points. I rewrote the regex. One main issue was my first .* was greedy causing everything to match. Once I made a few corrections then made that .* lazy, it matched twice and worked perfectly. Once the time is up I will post the solution and add more detail. Thanks for your help aleph_null – Eric Reynolds Nov 05 '11 at 05:54
I guess for those that are interested, here's the regex solution: [\w]* – Eric Reynolds Nov 05 '11 at 05:59
1

@EricReynolds, pretty sure this fails on some of the examples I have below. Why roll your own when these problems have been solved, many times, already? :/ – bdonlan Nov 05 '11 at 06:04
@EricReynolds, I'll also note that not all webpages end in .htm or .html, and there are lots of characters allowed in URLs that your regex doesn't include - `-`, for example. – bdonlan Nov 05 '11 at 06:06
2

@bdonlan, Again ... I have very specific specifications for the assignment. The goal isn't parse every possible know URL. I have said this like 3x and said it in my original description... – Eric Reynolds Nov 05 '11 at 10:19

score 0 · Accepted Answer · answered Nov 05 '11 at 20:24

Regex Solution

So I was playing around and realized my issue. I adjusted my regex a bit. My main problem was at the beginning my .* was causing everything to match up until the last tag, and therefore it was really only matching once instead of twice. I made that .* lazy and it matched twice instead of once. That was the only issue. Once that regex was added to java, my loop code worked fine.

Thanks everyone that responded. While you may not have provided the answer, your comments got me thinking in the right direction!

score -1 · Answer 4 · answered Nov 05 '11 at 04:34

-1

You would have to look through all the matches you got per line and find which one looks like a url (like with some more regex ;))

answered Nov 05 '11 at 04:34

DanZimm

2,528
2
19
27

Regular expressions are incapable of correctly parsing (X)HTML. Don't even try. There are much more effective libraries out there. – bdonlan Nov 05 '11 at 04:39
That should be done in the while(m.find()) loop. My understanding of the Matcher classes find() method is that it moves through each match until it does not match anymore (and returns false). – Eric Reynolds Nov 05 '11 at 04:39

Regular Expressions to match an tag

4 Answers4