0

Given the following HTML content (limited to the absolute minimum I require):

enter image description here

How would I be able to extract Page Title using Regex?

Daan
  • 1,417
  • 5
  • 25
  • 40
  • Are you only grabbing titles or are you going to be parsing out more from the document? If so, use an HTML parser. – David B Sep 10 '12 at 16:00
  • 1
    You may look at [this](http://stackoverflow.com/a/1732454/148481) answer – Luca Martini Sep 10 '12 at 16:02
  • Wow :O Happened to have missed that. So should I use an HTML parser, and if so, which one? – Daan Sep 10 '12 at 16:07
  • It depends on what language you want to use. The main reason for an HTML parser is the malformed nature of HTML/XML. – David B Sep 10 '12 at 16:09
  • 1
    The language is C# (if that's what you mean). I still feel that an HTML parser is overkill in my situation. What if we assume that the pattern is always exactly this way, can't I better use regex? – Daan Sep 10 '12 at 16:15

1 Answers1

1

As others have commented, regular expressions may not be suitable for a bullet-proof method. E.g. using regex, it would be difficult to check if the <title> tag were part of a quoted string within the HTML. That's a recurring response on StackOverflow for questions like this. But personally, I think you've got a point that a parser would be overkill for such a simple extraction. If you're looking for a method that works most of the time, one of the following should surfice.

Option 1: Lookbehind / lookahead

(?<=<title[\s\n]*>[\s\n]*)(.(?![\s\n]*</title[\s\n]*>))*

This uses lookbehind and lookahead for the tags - .NET has a sophisticated regex engine that allows for infinite repetition so you can even check for whitespace/return characters between the tag name and end brace (see this answer).

Option 2: Capturing group

<title[\s\n]*>[\s\n]*(.*)[\s\n]*</title[\s\n]*>

Similar but slightly simpler - the whole regex match includes the start and end tags. The first (and only) capturing group (.*) captures the bit that is of interest in between.

Visualisation: Regular expression visualization

Edit live on Debuggex

Community
  • 1
  • 1
Steve Chambers
  • 37,270
  • 24
  • 156
  • 208