-3

In this case, I am supposed to only use a single regex match.
See the following HTML code:

<html>
  <body>
    <p>This is some <strong>strong</strong> text</p>
  </body>
</html>

I want to make a regex that can return This is some strong text. In this case, the text inside the <p> tag.

Overall, it should:

  • Match only text between two HTML tags.
  • Exclude HTML tags within the two tags, but keep the text inside those tags.

So far I know:

  • <p>(.*)<\/p> Will match the region from <p> to </p>
  • <[^>]*> Will match any HTML tag

The hard part for me is how to combine the two (maybe there is an even better way of doing it). How would you write such regex?

Frederik Witte
  • 936
  • 1
  • 16
  • 31

2 Answers2

3

How real software engineers solve this problem: Use the right tool for the right job, i.e. don't use regexes to parse HTML

The most straightforward way is to use an HTML parsing library, since parsing even purely conforming XML with regex is extremely non-trivial, and handling all HTML edge cases is an inhumanly difficult task.


If your requirements are "you must use a regex library to pull innerHTML from a <p> element", I'd much prefer to split it into two tasks:

1) using regex to pull out the container element with its innerHTML. (I'm showing an example that only works for getting the outermost element of a known tag. To extract an arbitrary nested item you'd have to use some trick like https://blogs.msdn.microsoft.com/bclteam/2005/03/15/net-regular-expressions-regex-and-balanced-matching-ryan-byington/ to match the balanced expression)

2) using a simple Regex.Replace to strip out all tag content

let html = @"<p>This is some <strong>strong</strong> text</p>
<p>This is some <b><em>really<strong>strong</strong><em></b> text</p>"

for m in Regex.Matches(html, @"<p>(.*?)</p>") do
    printfn "(%O)" (Regex.Replace(m.Groups.[1].Value, "<.*?>", ""))

(This is some strong text)
(This is some reallystrong text)

If you are constrained to a single "Regex.Matches" call, and you're okay with ignoring the possibility of nested <p> tags (as luck would have it, in conformant HTML you can't nest ps but this solution wouldn't work for a containing element like <div>) you should be able to do it with a nongreedy matching of a text part and a tag part wrapped up inside a <p>...</p> pattern. (Note 1: this is F#, but it should be trivial to convert to C#) (Note 2: This relies on .NET-flavored regex-isms like stackable group names and multiple captures per group)

let rx = @"
<p>
(?<p_text>
 (?:
   (?<text>[^<>]+)
   (?:<.*?>)+
 )*?
 (?<text>[^<>]+)?
)</p>
"
let regex = new Regex(rx, RegexOptions.IgnorePatternWhitespace)
for m in regex.Matches(@"
<p>This is some <strong>strong</strong> text</p>
<p>This is some <b><em>really<strong>strong</strong><em></b> text</p>
 ") do
    printfn "p content: %O" m
    for capture in m.Groups.["text"].Captures do
        printfn "text: %O" capture

p content: <p>This is some <strong>strong</strong> text</p>
text: This is some 
text: strong
text:  text
p content: <p>This is some <b><em>really<strong>strong</strong><em></b> text</p>
text: This is some 
text: really
text: strong
text:  text


Remember that both the above examples don't work that well on malformed HTML or cases where the same tag is nested in itsel
Jimmy
  • 89,068
  • 17
  • 119
  • 137
0

Following @Jimmy's answer, and going with the title of post on how to "extract" the text, I thought I would include the C# code for the Regex.Replace.

This bit of code should work to extract the text:

string HTML = "<html><body><p>This is some <strong>strong</strong> text</p></body></html>";

Regex Reg = new Regex("<[^>]*>");
String parsedText = Reg.Replace(HTML, "").Trim();

MessageBox.Show(parsedText);

Obviously this does not match between the two tags exclusively (it would grab anything outside the paragraph tags as well), but I would suggest that the replace function is the best option in making only ONE match.

If you need to get only the content between the two tags, I think you would need to do that in two expressions, as @Jimmy suggested.

I would be very curious to see if anyone could get it all in one expression, but I'm guessing this is what they are looking for at your school.

Hawkeye
  • 578
  • 8
  • 20