-3

I have an HTML string like this:

<p>First Sentence is this.&#160;Second sentence is this.</p>

I am able to remove the <p> tags from the above string using a regex function.

But, how to remove &#160; - encoded characters from the above string in winforms?

I don't want &#160; to be present in the output.

Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
Earth
  • 3,477
  • 6
  • 37
  • 78
  • 6
    Do you _understand_ the regex you copypasted? It's not going to touch the ` ` at all. You'll also want to unencode HTML-encoded entities. How to do that is very well documented. See for example [How can I decode HTML characters in C#?](http://stackoverflow.com/questions/122641/how-can-i-decode-html-characters-in-c). – CodeCaster May 04 '15 at 10:41

2 Answers2

5

You can use XElement.Parse to get the node value like this:

 var htmlString = "<p>First Sentence is this.&#160;Second sentence is this.</p>";
 var result = System.Xml.Linq.XElement.Parse(htmlString).Value;

If not all the strings contain valid XML structure, or may have no tags at all, you can add fake tags like this:

 var htmlString = "<p>First Sentence is this.&#160;Second sentence is this.</p>";
 var result = System.Xml.Linq.XElement.Parse("<root>" + htmlString + "</root>").Value;

Result:

enter image description here

You might want to add error handling for this, but this is clearly better than using a regex for this.

EDIT:

In case this is still not working, and you want to just handle the entities, you can leverage System.Web.HttpUtility.HtmlDecode method to replace HTML entities with literals:

var final_result = System.Web.HttpUtility.HtmlDecode(result);
Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
  • I am using winforms and System.Xml.Linq is not getting populated. – Earth May 04 '15 at 10:51
  • I think adding a reference to the `System.Xml.Linq.dll` (and perhaps, `System.Xml.dll`) should fix it. Also, try adding `using System.Linq;` and `using System.Xml.Linq;` to the list of using statements. – Wiktor Stribiżew May 04 '15 at 11:03
  • I tried now by adding `System.Xml.Linq` dll. I am getting the output as `

    First Sentence is this.Second sentence is this.

    ` That is, I am getting the `html` tags now.
    – Earth May 04 '15 at 11:08
  • Could you show me the complete code you have? Using http://pastebin.com/, e.g. I guess you have `<p>` instead of `

    `, right? Try `var result = System.Xml.Linq.XElement.Parse(htmlString.Replace("<", "<").Replace(">", ">").Replace("&", "&")).Value;`

    – Wiktor Stribiżew May 04 '15 at 11:15
  • Now I am able to get the expected output using your latest code. But, fot the `html strings` that containing class names like `

    – Earth May 04 '15 at 11:36
  • If I am using `regex` - `@"<(.|\n)*?>";` for those `html string`, I am not getting that error message but atlast(end of the html string) i am getting the encoded characters like `\n \n`. So, I am thinking to do like first to use `regex` and then to work with `xelement` with that regexed `html string`..Is that correct?..Meanwhile I am trying this way.. – Earth May 04 '15 at 11:44
  • No..I am not getting correct this way..Now I am getting the error message as `Data at the root level is invalid. Line 1, position 1.` when doing `XElement.Parse`. – Earth May 04 '15 at 11:55
  • Do you mean there are strings that do not contain tags? Then, you should just add fake tags in the start and end. Or, you can try using `System.Web.HttpUtility.HtmlDecode`, e.g.: `var rs = System.Web.HttpUtility.HtmlDecode(result);`. You should add a reference to `System.Web`. – Wiktor Stribiżew May 04 '15 at 12:04
  • I tried `WebUtility.HtmlDecode` and I got `\n \n` characters at the end of the string. – Earth May 04 '15 at 12:12
  • Ok, you can use `.Trim()` to get rid of intial and final whitespace. – Wiktor Stribiżew May 04 '15 at 12:12
  • I am trying with `trim()`. Meanwhile I am posting here that `html string` in the next comment since too long. – Earth May 04 '15 at 12:16
  • `

    This is the First Sentence.

     


    `
    – Earth May 04 '15 at 12:16
  • Yes. `Trim` I used and I am getting the expected output. I also checked with two other complex `html strings`. Working fine as expected. Thanks for your valuable time and help. – Earth May 04 '15 at 12:23
  • 1
    Great. I am not against using a regex, esp. in cases like this. Perhaps, you can also try with `HtmlDocument`, too. I will test, and let you know. – Wiktor Stribiżew May 04 '15 at 12:29
-3

Considering the fact the input is a plain string

string x = "<p>First Sentence is this.&#160;Second sentence is this.</p>";
x= x.Replace("&#160;"," ");

This is way too simple, but will work.

Uwe Keim
  • 39,551
  • 56
  • 175
  • 291
Newton Sheikh
  • 1,376
  • 2
  • 19
  • 42