2

I am trying to extract page title from HTML and XML pages. This is the regular expression I use:

Pattern p = Pattern.compile(".*<head>.*<title>(.*)</title>.*</head>.*");

The problem is that it only extracts the title from HTML files and gives me null for XML files. Can any one help me in changing the regex to the get the XML page titles as well?

Code:

content= stringBuilder.toString(); // put content of the file as a string
Pattern p = Pattern.compile(".*<head>.*<title>(.*)</title>.*</head>.*");
Matcher m = p.matcher(content);
while (m.find()) {
    title = m.group(1);
}
Alan Moore
  • 73,866
  • 12
  • 100
  • 156
Lucy
  • 471
  • 4
  • 12
  • 28
  • 6
    Have you considered [*not* using a regex to parse HTML](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags)? – Jon Skeet Mar 28 '12 at 17:27
  • This sort of question is common, and the answer is the same: regex isn't suitable for parsing HTML. That being said, for something very tactical like this, you might be successful. Post your code and we'll look at it. – Tony Ennis Mar 28 '12 at 17:35
  • content= stringBuilder.toString(); // put content of the file as a string Pattern p = Pattern.compile(".*.*(.*).*.*"); Matcher m = p.matcher(content); while (m.find()) { title = m.group(1); } – Lucy Mar 28 '12 at 17:56
  • What is the structure the title comes in for XML? There's no need for an XML file to obey the head-title structure that HTML uses. – GetSet Mar 28 '12 at 18:22
  • 1
    possible duplicate of [Extracting Information from websites](http://stackoverflow.com/questions/318564/extracting-information-from-websites) – outis Mar 28 '12 at 20:42

2 Answers2

3

As said above, regexp are not suited for XML and HTML parsing. However, in some cases it come in handy, so here is something that should work:

Pattern p = Pattern.compile("<head>.*?<title>(.*?)</title>.*?</head>", Pattern.DOTALL); 
Matcher m = p.matcher(content);
while (m.find()) {
    title = m.group(1);
}

If you use a Matcher, there is no need to put .* before and after (since they are not part of any group). You may also look into reluctant qualifier (ie, *? instead of *, +? instead of +, etc...) if it does not. Finally you should also use the Pattern.DOT_ALL flag otherwise the dot does not match the line terminator character

Guillaume Polet
  • 47,259
  • 4
  • 83
  • 117
1

OMG.. Regular expressions for this ? What about following (for example to strip body portion )

StringBuilder sb = new StringBuilder();
sb.append(html, html.indexOf("<body>") + 6, html.lastIndexOf("</body>"));
String headless = sb.toString();
System.out.println(headless);
Mitja Gustin
  • 1,723
  • 13
  • 17