1

I am trying to compare 2 strings but i just realized that one has some html formatting already.

How can i get these two strings to match when doing string1 == string2. (NOTE: i dont know what the HTML formatting is going to be upfront)

string1 = "This is a test";
string1 = "<font color=\"black\" size=\"1\">This is a test</font>";
Mikael Svenson
  • 39,181
  • 7
  • 73
  • 79
leora
  • 188,729
  • 360
  • 878
  • 1,366

3 Answers3

7

Load the html into Html Agility Pack, and extract only the text.

string html = "<html><body><div>test</div></body></html>";
HtmlDocument document = new HtmlDocument();
document.LoadHtml(html)
string text = document.DocumentNode.InnerText;

This will not remove the content of <script> nodes, but you can easily remove the script nodes first.

SpruceMoose
  • 9,737
  • 4
  • 39
  • 53
Mikael Svenson
  • 39,181
  • 7
  • 73
  • 79
  • 1
    Obligatory link - http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 – Oded Aug 06 '10 at 11:09
  • @Mikael Svenson - how do extract only the text using the HTML Agility Pack ? – leora Aug 06 '10 at 11:12
  • @ooo: I added a sample on how to do it. – Mikael Svenson Aug 06 '10 at 11:14
  • HtmlAgilityPack is pretty awesome for parsing HTML. – alimbada Aug 06 '10 at 11:17
  • @Mikael Svenson - can i simply shove in This is a test into this code above as it doesn't seem to work. hw.Load() seems to be looking for a URL as the parameter – leora Aug 06 '10 at 13:25
  • @ooo use HtmlNode.CreateNode(myHtmlString) to create a new HtmlNode and then get the InnerText of the instance – alimbada Aug 06 '10 at 14:03
  • @Mikael Svenson - thanks . .this worked but i found one issue.. i put that issue in question here: http://stackoverflow.com/questions/3425554/does-the-html-agility-pack-work-on-internal-text – leora Aug 06 '10 at 16:06
  • Are you sure that `InnerText` removes *all* HTML tags, rather than just the outermost pair? – Timwi Aug 06 '10 at 20:14
  • Yes. Just like it will in the DOM object on a webpage (except innerText is not cross platform javascript). InnerText gives back all text inside the container you start from.
    lala
    lala
    gives the same result as
    lala
    lala
    – Mikael Svenson Aug 06 '10 at 20:41
0
string newText = System.Text.RegularExpressions.Regex.Replace(OldHtmlTextHere, "<[^>]*>", string.Empty);
Moslem Hadi
  • 846
  • 5
  • 18
  • 30
  • 1
    Hehe, cool. This may still fail though in case the inner text has character entities like & . Then again, I am not sure if the accepted anser's solution does take care of that. – Martin Maat Jun 20 '16 at 05:49
  • @MartinMaat I don know, I use this function in all of my projects. never let me down! – Moslem Hadi Jun 20 '16 at 12:57
-5

Check out system.web.Httputility.HTMLdecode

PPShein
  • 13,309
  • 42
  • 142
  • 227