0

I have a HTML file and I am trying to retrieve valid innertext from each tag. I am using Regex for this with the following pattern:

(?<=>).*?(?=<)

It works fine for simple innertext. But, I recently encountered following HTML pieces:

<div id="mainDiv"> << Generate Report>> </div>
<input id="name" type="text">Your Name->></input>

I am not sure, how to retrieve these innertexts with regular expressions? Can someone please help?

Thanks

K S
  • 301
  • 1
  • 4
  • 16
  • 2
    Please read [this](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags#answer-1732454), and opt to HTML Agility Pack – Andrei Feb 11 '14 at 17:49

3 Answers3

2

I'd use a parser, but this is possible with RegEx using something like:

<([a-zA-Z0-9]+)(?:\s+[^>]+)?>(.+?)<\/\1>

Then you can grab the inner text with capture group 2.

tenub
  • 3,386
  • 1
  • 16
  • 25
  • Thanks for your response. This grabs the entire div block. Not just the inner text. Also, there is only one capture. :( – K S Feb 11 '14 at 18:11
  • `([a-zA-Z0-9]+)` is the first capture and `\/\1` is the backreference to it to match the closing tag. `(.+?)` is the second capture group. I said to retrieve the second capture group, not the entire match (capture group 0). – tenub Feb 11 '14 at 18:13
1

That's exactly why you don't use regex for parsing html.Although you can get around this problem by using backreference in regex

(?<=<(\w+)[<>]*>).*?(?=/<\1>)

Though that wont work always because

  • tags wont always have a closing tag
  • tag attributes can contain <>
  • arbitrary spaces around tag's name

Use an html parser like htmlagilitypack

Your code would be as simple as this

HtmlDocument doc = new HtmlDocument();
doc.Load("file.htm");
//InnerText of all div's
List<string> divs=doc.DocumentElement
                     .SelectNodes("//div")
                     .Select(x=>x.InnerText).ToList();
Anirudha
  • 32,393
  • 7
  • 68
  • 89
1

You can always eliminate HTML tags which can be described by a regular grammar while HTML cannot. Replace "<[a-zA-Z][a-zA-Z0-9]*\s*([a-zA-Z]+\s*=\s*("|')(?("|')(?<=).|.)("|')\s*)*/?>" with string.Empty.

That regex should match any valid HTML tag.

EDIT: If you do not want to obtain a concatenated result you can use "<" instead of string.Empty and then split by '<' since '<' in HTML always starts a tag and should never be displayed. Or you can use the overload of Regex.Replace that takes a delegate and use match index and match length (it may turn out more optimal that way). Or even better use Regex.Match and go from matched tag to matched tag. substring(PreviousMatchIndex + PreviousMatchLength, CurrentMatchIndex - PreviousMatchIndex + PreviousMatchLength) should provide the inner text.

Andrei15193
  • 655
  • 5
  • 8