Retrieving special InnerText from HTML using Regex in C#

Question

I have a HTML file and I am trying to retrieve valid innertext from each tag. I am using Regex for this with the following pattern:

(?<=>).*?(?=<)

It works fine for simple innertext. But, I recently encountered following HTML pieces:

<div id="mainDiv"> << Generate Report>> </div>
<input id="name" type="text">Your Name->></input>

I am not sure, how to retrieve these innertexts with regular expressions? Can someone please help?

Thanks

Please read [this](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags#answer-1732454), and opt to HTML Agility Pack — Andrei, Feb 11 '14 at 17:49

score 2 · Accepted Answer · answered Feb 11 '14 at 17:54

2

I'd use a parser, but this is possible with RegEx using something like:

<([a-zA-Z0-9]+)(?:\s+[^>]+)?>(.+?)<\/\1>

Then you can grab the inner text with capture group 2.

answered Feb 11 '14 at 17:54

tenub

3,386
1
16
25

Thanks for your response. This grabs the entire div block. Not just the inner text. Also, there is only one capture. :( – K S Feb 11 '14 at 18:11
`([a-zA-Z0-9]+)` is the first capture and `\/\1` is the backreference to it to match the closing tag. `(.+?)` is the second capture group. I said to retrieve the second capture group, not the entire match (capture group 0). – tenub Feb 11 '14 at 18:13

Anirudha · Answer 2 · 2014-02-13T05:05:52.257

1

That's exactly why you don't use regex for parsing html.Although you can get around this problem by using backreference in regex

(?<=<(\w+)[<>]*>).*?(?=/<\1>)

Though that wont work always because

tags wont always have a closing tag
tag attributes can contain <>
arbitrary spaces around tag's name

Use an html parser like htmlagilitypack

Your code would be as simple as this

HtmlDocument doc = new HtmlDocument();
doc.Load("file.htm");
//InnerText of all div's
List<string> divs=doc.DocumentElement
                     .SelectNodes("//div")
                     .Select(x=>x.InnerText).ToList();

edited Feb 13 '14 at 05:05

answered Feb 11 '14 at 17:53

Anirudha

32,393
7
68
89

Didn't know c# allowed quantifiers in lookbehinds. I'm jealous. – tenub Feb 11 '14 at 17:55
@tenub only .net supports it... :-) – Anirudha Feb 11 '14 at 17:56
For some reason, HtmlDocument constructor is not defined. I do have reference to System.Windows.Forms.HtmlDocument in my solution. – K S Feb 11 '14 at 18:10
@HaritSingh You are using the wrong class..you need to use htmlagilitypack's htmldocument class – Anirudha Feb 11 '14 at 18:19

Andrei15193 · Answer 3 · 2014-02-11T18:08:27.573

You can always eliminate HTML tags which can be described by a regular grammar while HTML cannot. Replace "<[a-zA-Z][a-zA-Z0-9]*\s*([a-zA-Z]+\s*=\s*("|')(?("|')(?<=).|.)("|')\s*)*/?>" with string.Empty.

That regex should match any valid HTML tag.

EDIT: If you do not want to obtain a concatenated result you can use "<" instead of string.Empty and then split by '<' since '<' in HTML always starts a tag and should never be displayed. Or you can use the overload of Regex.Replace that takes a delegate and use match index and match length (it may turn out more optimal that way). Or even better use Regex.Match and go from matched tag to matched tag. substring(PreviousMatchIndex + PreviousMatchLength, CurrentMatchIndex - PreviousMatchIndex + PreviousMatchLength) should provide the inner text.

Retrieving special InnerText from HTML using Regex in C#

3 Answers3