0

I have the following regular expression:

(?:<(?<tag>\w*)>(?<text>.*)</\k<tag>>)

I want it t grab the text within the first HTML element.

eg.

<p>This should capture</p>This shouldn't

Works, but ...

<p>This should capture</p><p>This shouldn't</p>

Doesn't work. As you'd expect, it returns:

This should capture</p><p>This shouldn't

I'm racking my brains here. How can I just have it select the FIRST inner text?

(I'm trying to be tag-agnostic, so <strong>This should match</strong> is equally appropriate, etc.)

Program.X
  • 7,250
  • 12
  • 49
  • 83
  • 7
    **DO NOT PARSE HTML USING Regular Expressions!** – SLaks Jun 03 '10 at 15:41
  • And there was me thinking I should do that instead of building a state machine. Any reason why? – Program.X Jun 03 '10 at 15:42
  • 2
    See this answer to a similar question: http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 – Sarah Vessels Jun 03 '10 at 15:43
  • 1
    http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 – SLaks Jun 03 '10 at 15:43
  • Fair enough. I know what you're saying, I just wanted the first bit of text so I didn't anticipate parsing an extensive document. Thanks. – Program.X Jun 03 '10 at 15:46
  • 2
    @Program: And what do you expect to do in the case of `

    text

    more text

    `? You can't keep track of arbitrary nesting like this using RegEx.
    – BlueRaja - Danny Pflughoeft Jun 03 '10 at 15:52

3 Answers3

3

You should use the HTML Agility Pack.

For example:

doc.DocumentNode.Descendants("p").First().InnerText
SLaks
  • 868,454
  • 176
  • 1,908
  • 1,964
  • 1
    I actually am looking at using the HtmlAgilityPack for another section of the project so this is "on the radar". I might just use it in the longer term. – Program.X Jun 03 '10 at 15:52
2

Stop. Just stop. If you are parsing HTML, use an HTML parser (or XML if you're dealing with valid XHTML). See this answer for more info.

Community
  • 1
  • 1
Hank Gay
  • 70,339
  • 36
  • 160
  • 222
1

In order to have a non-greedy * selection, you should add an ? after the *.

(?:<(?<tag>\w*)>(?<text>.*?)</\k<tag>>)
HoLyVieR
  • 10,985
  • 5
  • 42
  • 67
  • Thanks. I'm going to go for that only because it is very simple work I am doing and I am coping with failure elegantly. Then again @BlueRaja has just blown a hole in my theory. Sorry. – Program.X Jun 03 '10 at 15:51