-1

I have a HTML string that I just want the text from:

string html = "<span class="MyText" id="1">     SomeText blah blah</span>";

So I use the following expression:

public static string StripHTMLTags(string source)
{
    return Regex.Replace(source, "<.*?>", string.Empty);
}

But sometimes the HTML string contains several lines of HTML:

string html = "<span class="MyText" id="1">SomeText blah blah</span<br><span class="MyText" id="2">SomeText blah blah 1</span><br><span class="MyText" id="2">SomeText blah blah2</span>";

So now I want to extract out the text that is between the <span> tags and store them in a list or array or lines.

NOTE: I am parsing custom HTML that will only have two tags the break and span tags.

How can I do this using Regex?

Cœur
  • 37,241
  • 25
  • 195
  • 267
Harry Boy
  • 4,159
  • 17
  • 71
  • 122

3 Answers3

1

Parsing HTML with Regex is cumbersome and errorprone. Have a look at the rather famous StackOverflow post RegEx match open tags except XHTML self-contained tags.

I suggest to use a library for that. One that is widely used is the Html Agility Pack at http://html-agility-pack.net available via Nuget.

EDIT:

In order to get the inner text of HTML you can use something like this:

var pageDoc = new HtmlDocument();
pageDoc.LoadHtml(pageContent);
var pageText = pageDoc.DocumentNode.InnerText;
wp78de
  • 18,207
  • 7
  • 43
  • 71
Ralf Bönning
  • 14,515
  • 5
  • 49
  • 67
  • I am parsing custom HTML that will only have two tags
    and
    – Harry Boy Sep 06 '16 at 11:26
  • @HarryBoy - this may change over time. I have a sample added to the post. I think the code is easier to understand than a RegEx-Expression that can get rather complicated. – Ralf Bönning Sep 06 '16 at 11:29
0

I don't know if you can solve this with a different REGEX statement (just don't know much about regular expressions), but what you could do is split the string every time a "><" occurs and then extract the text of each of those substrings.

Also http://regexr.com/ might help you trying different statements.

Edit: Is there always a '< br>' after a '< /span>' ?

Djindjidj
  • 133
  • 8
0

Beware if running in non-private application. As I said:

HTML is not regular enough to be parsed with regular expressions

However, this simple HTML snippets can be parsed with following one:

string txt =
    @"""<span class=""MyText"" id=""1"">SomeText blah blah</span<br><span class=""MyText"" id=""2"">SomeText blah blah 1</span><br><span class=""MyText"" id=""2"">SomeText blah blah2</span>""";

var matches = Regex.Matches(txt, "(?<=>)([^<]+)(?=<)");
foreach (Match match in matches)
    Console.WriteLine(match.Value);

It yields:

SomeText blah blah
SomeText blah blah 1
SomeText blah blah2
Paweł Dyl
  • 8,888
  • 1
  • 11
  • 27