1

I'm trying to extract the text within the title elements and ignore everything else.

I've looked at these articles, but they don't seem to help :\
Regular expression to extract text between square brackets
String Pattern Matching In Java
Java Regex to get the text from HTML anchor (<a>...</a>) tags

The main problem is I am not able to understand what the responders are saying while trying to hack up my own code.

Here is what I've managed from reading the Java API in the Pattern article.

<title>(.*?)</title>

Here's my code to return the title.

String title = null;
Matcher match = Pattern.compile("[<title>](.*?)[</title>]").matcher(this.webPage);
try{
    title = match.group();
}
catch(IllegalStateException e)
{
    e.printStackTrace();
}

I am getting the IllegalStateException, which says this:

java.lang.IllegalStateException: No match found
    at java.util.regex.Matcher.group(Matcher.java:485)
    at java.util.regex.Matcher.group(Matcher.java:445)
    at BrowserModal.getWebPageTitle(BrowserModal.java:21)
    at BrowserTest.main(BrowserTest.java:7)

Line 21 would be "title = match.group();"

Community
  • 1
  • 1
ryandawkins
  • 1,537
  • 5
  • 25
  • 38
  • 1
    Please refrain from parsing HTML with RegEx. [Just trust us on this one](http://stackoverflow.com/a/1732454). Try an HTML or XML parser instead. – Matt Ball Feb 28 '13 at 05:26
  • Matt is right. Regular expressions are not the correct tool for the job. To give just one example of what's wrong with it, consider the possibility of comments: `List of <!--current -->products` – VGR Mar 01 '13 at 12:47

3 Answers3

3

What are the pros and cons of the leading Java HTML parsers? lists a bunch of HTML parsers. Parse your HTML to a DOM, then use getElementsByClassName("title") to get the title elements, and grab the text content by looking at its children which should be text nodes.


title = match.group();

This is failing because group() returns the entire matched text. group(1) will return just the content of the first parenthetical group.


[<title>](.*?)[</title>]

The square brackets are just breaking it. [<title>] will match any single character that is an angle bracket or a letter in the word "title".

<title>(.*?)</title>

is better, but will only match a title that is on one line (since . does not, by default, match newlines, and will not match minor variations like

<title lang=en>Foo</title>

It will also fail to find the title correctly in HTML like

<html>
<head>
<!-- <title>Old commented out title</title> -->
<title>Spiffy new title</title>
Community
  • 1
  • 1
Mike Samuel
  • 118,113
  • 30
  • 216
  • 245
  • +1, with a nit-pick. The `group()` call is failing in the sense of *throwing an exception* because the regex is never applied by calling `find()` (as demonstrated in [R.J's answer](http://stackoverflow.com/a/15128085/20938)). But once that's fixed it will fail in the sense of *returning the wrong result*, as you said. – Alan Moore Feb 28 '13 at 07:57
2

Try this:-

        String title = null;
        String subjectString = "<title>TextWithinTags</title>";
        Pattern titleFinder = Pattern.compile("<title[^>]*>(.*?)</title>", Pattern.DOTALL | Pattern.CASE_INSENSITIVE);
        Matcher regexMatcher = titleFinder.matcher(subjectString);
        while (regexMatcher.find()) {
            title = regexMatcher.group(1);
        }

Edit:- Regex explained:-

[^>]* :- Anything but > is acceptable there. This is used as we can have attributes in the tags.

(.*?) :- Dot represents any character other than newline character. *? represents repeat any number of times, but as few as possible.

For more details on regex, check this out.

Rahul
  • 44,383
  • 11
  • 84
  • 103
-1

This gets the title in just one line of java code:

String title = html.replaceAll("(?s).*<title>(.*)</title>.*", "$1");

This regex assumes the HTML is "simple", and with the "DOTALL" switch (?s) (which means dots also match new-line chars), it will work with multi-line input, and even multi-line titles.

Bohemian
  • 412,405
  • 93
  • 575
  • 722
  • Actually, `(?s)` activates DOTALL mode (also known as single-line mode). Maybe you're thinking of Ruby? But it uses `(?m)` for that, not `(?s)`. – Alan Moore Feb 28 '13 at 08:09
  • @AlanMoore no, I'm thinking of java. Without that switch the regex wouldn't match text that spans across a line feed. – Bohemian Feb 28 '13 at 08:25
  • 1
    That's true. Trouble is, it will match across *all* of the linefeeds. Each instance of `.*` can and will gobble up the whole remaining document at first, only to have to backtrack almost to the beginning again. – Alan Moore Feb 28 '13 at 08:58
  • @AlanMoore what backtracking/problem? The performance of this would be measured microseconds. This code would work just fine. – Bohemian Feb 28 '13 at 11:10
  • 1
    When I apply your regex to the source code of this page, it takes three or four seconds to complete the match. RegexBuddy reports that it takes over 277,000 steps. And that's for a *successful* match. If I remove the `/` from the closing tag in the source (to simulate the kind of sloppy HTML we see so much of), it takes almost *ten* seconds and over 443,00 steps to report failure. It probably won't matter in most cases, but you should be aware that the convenience of `.*` comes at a price, especially when you use it in DOTALL/singleline mode. – Alan Moore Feb 28 '13 at 23:28
  • @AlanMoore When I test the source for this page (about 60K) using this code (copy-pasted), it take 38 milliseconds to (successfully) execute (Eclipse Juno). Clearly there's nothing "bad" about this code - there must be some kind of problem with your environment. – Bohemian Mar 03 '13 at 18:36