Regex to extract pure text within specific HTML tag

Question

In this case, I am supposed to only use a single regex match.
See the following HTML code:

<html>
  <body>
    <p>This is some <strong>strong</strong> text</p>
  </body>
</html>

I want to make a regex that can return This is some strong text. In this case, the text inside the  tag.

Overall, it should:

Match only text between two HTML tags.
Exclude HTML tags within the two tags, but keep the text inside those tags.

So far I know:

(.*)<\/p> Will match the region from  to 
<[^>]*> Will match any HTML tag

The hard part for me is how to combine the two (maybe there is an even better way of doing it). How would you write such regex?

Seriously look into `HtmlAgilityPack` (free and available via Nuget) - it'll make you a happier man! — code4life, Sep 06 '17 at 17:02
Maybe something like HtmlAgilityPack (https://www.nuget.org/packages/HtmlAgilityPack) would be more suited to your needs. — Titian Cernicova-Dragomir, Sep 06 '17 at 17:02
Why can you only use regex? What sort of monster gave you those requirements? — Broots Waymb, Sep 06 '17 at 17:05
What a terrible example to use for teaching regular expressions. https://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags — , Sep 06 '17 at 17:09
Poor use of regex but if the input is restricted to where it can be done, it might be an okay as exercise—as long as you aren't misled into thinking it's a good tool for HTML. — Tom Blodget, Sep 06 '17 at 17:09
As an exemple for your teacher, add a `
` , a ``. An lets see how his solution parse that. — Drag and Drop, Sep 07 '17 at 06:14

Jimmy · Accepted Answer · 2017-09-06T18:54:40.737

How real software engineers solve this problem: Use the right tool for the right job, i.e. don't use regexes to parse HTML

The most straightforward way is to use an HTML parsing library, since parsing even purely conforming XML with regex is extremely non-trivial, and handling all HTML edge cases is an inhumanly difficult task.

If your requirements are "you must use a regex library to pull innerHTML from a  element", I'd much prefer to split it into two tasks:

1) using regex to pull out the container element with its innerHTML. (I'm showing an example that only works for getting the outermost element of a known tag. To extract an arbitrary nested item you'd have to use some trick like https://blogs.msdn.microsoft.com/bclteam/2005/03/15/net-regular-expressions-regex-and-balanced-matching-ryan-byington/ to match the balanced expression)

2) using a simple Regex.Replace to strip out all tag content

let html = @"<p>This is some <strong>strong</strong> text</p>
<p>This is some <b><em>really<strong>strong</strong><em></b> text</p>"

for m in Regex.Matches(html, @"<p>(.*?)</p>") do
    printfn "(%O)" (Regex.Replace(m.Groups.[1].Value, "<.*?>", ""))

(This is some strong text)
(This is some reallystrong text)

If you are constrained to a single "Regex.Matches" call, and you're okay with ignoring the possibility of nested  tags (as luck would have it, in conformant HTML you can't nest ps but this solution wouldn't work for a containing element like <div>) you should be able to do it with a nongreedy matching of a text part and a tag part wrapped up inside a ... pattern. (Note 1: this is F#, but it should be trivial to convert to C#) (Note 2: This relies on .NET-flavored regex-isms like stackable group names and multiple captures per group)

let rx = @"
<p>
(?<p_text>
 (?:
   (?<text>[^<>]+)
   (?:<.*?>)+
 )*?
 (?<text>[^<>]+)?
)</p>
"
let regex = new Regex(rx, RegexOptions.IgnorePatternWhitespace)
for m in regex.Matches(@"
<p>This is some <strong>strong</strong> text</p>
<p>This is some <b><em>really<strong>strong</strong><em></b> text</p>
 ") do
    printfn "p content: %O" m
    for capture in m.Groups.["text"].Captures do
        printfn "text: %O" capture

p content: <p>This is some <strong>strong</strong> text</p>
text: This is some 
text: strong
text:  text
p content: <p>This is some <b><em>really<strong>strong</strong><em></b> text</p>
text: This is some 
text: really
text: strong
text:  text

Remember that both the above examples don't work that well on malformed HTML or cases where the same tag is nested in itsel

Hawkeye · Answer 2 · 2017-09-07T16:28:25.417

Following @Jimmy's answer, and going with the title of post on how to "extract" the text, I thought I would include the C# code for the Regex.Replace.

This bit of code should work to extract the text:

string HTML = "<html><body><p>This is some <strong>strong</strong> text</p></body></html>";

Regex Reg = new Regex("<[^>]*>");
String parsedText = Reg.Replace(HTML, "").Trim();

MessageBox.Show(parsedText);

Obviously this does not match between the two tags exclusively (it would grab anything outside the paragraph tags as well), but I would suggest that the replace function is the best option in making only ONE match.

If you need to get only the content between the two tags, I think you would need to do that in two expressions, as @Jimmy suggested.

I would be very curious to see if anyone could get it all in one expression, but I'm guessing this is what they are looking for at your school.

Regex to extract pure text within specific HTML tag

2 Answers2