Complete HTML Strip function

Question

I have an HTML string like this:

<p>First Sentence is this.&#160;Second sentence is this.</p>

I am able to remove the <p> tags from the above string using a regex function.

But, how to remove   - encoded characters from the above string in winforms?

I don't want   to be present in the output.

Do you _understand_ the regex you copypasted? It's not going to touch the ` ` at all. You'll also want to unencode HTML-encoded entities. How to do that is very well documented. See for example [How can I decode HTML characters in C#?](http://stackoverflow.com/questions/122641/how-can-i-decode-html-characters-in-c). — CodeCaster, May 04 '15 at 10:41

Wiktor Stribiżew · Accepted Answer · 2015-05-04T12:08:23.077

5

You can use XElement.Parse to get the node value like this:

 var htmlString = "<p>First Sentence is this.&#160;Second sentence is this.</p>";
 var result = System.Xml.Linq.XElement.Parse(htmlString).Value;

If not all the strings contain valid XML structure, or may have no tags at all, you can add fake tags like this:

 var htmlString = "<p>First Sentence is this.&#160;Second sentence is this.</p>";
 var result = System.Xml.Linq.XElement.Parse("<root>" + htmlString + "</root>").Value;

Result:

enter image description here

You might want to add error handling for this, but this is clearly better than using a regex for this.

EDIT:

In case this is still not working, and you want to just handle the entities, you can leverage System.Web.HttpUtility.HtmlDecode method to replace HTML entities with literals:

var final_result = System.Web.HttpUtility.HtmlDecode(result);

edited May 04 '15 at 12:08

answered May 04 '15 at 10:43

Wiktor Stribiżew

607,720
39
448
563

I am using winforms and System.Xml.Linq is not getting populated. – Earth May 04 '15 at 10:51
I think adding a reference to the `System.Xml.Linq.dll` (and perhaps, `System.Xml.dll`) should fix it. Also, try adding `using System.Linq;` and `using System.Xml.Linq;` to the list of using statements. – Wiktor Stribiżew May 04 '15 at 11:03
I tried now by adding `System.Xml.Linq` dll. I am getting the output as `
First Sentence is this.Second sentence is this.
` That is, I am getting the `html` tags now. – Earth May 04 '15 at 11:08
Could you show me the complete code you have? Using http://pastebin.com/, e.g. I guess you have `<p>` instead of `
`, right? Try `var result = System.Xml.Linq.XElement.Parse(htmlString.Replace("<", "<").Replace(">", ">").Replace("&", "&")).Value;`
– Wiktor Stribiżew May 04 '15 at 11:15
Now I am able to get the expected output using your latest code. But, fot the `html strings` that containing class names like `
– Earth May 04 '15 at 11:36
If I am using `regex` - `@"<(.|\n)*?>";` for those `html string`, I am not getting that error message but atlast(end of the html string) i am getting the encoded characters like `\n \n`. So, I am thinking to do like first to use `regex` and then to work with `xelement` with that regexed `html string`..Is that correct?..Meanwhile I am trying this way.. – Earth May 04 '15 at 11:44
No..I am not getting correct this way..Now I am getting the error message as `Data at the root level is invalid. Line 1, position 1.` when doing `XElement.Parse`. – Earth May 04 '15 at 11:55
Do you mean there are strings that do not contain tags? Then, you should just add fake tags in the start and end. Or, you can try using `System.Web.HttpUtility.HtmlDecode`, e.g.: `var rs = System.Web.HttpUtility.HtmlDecode(result);`. You should add a reference to `System.Web`. – Wiktor Stribiżew May 04 '15 at 12:04
I tried `WebUtility.HtmlDecode` and I got `\n \n` characters at the end of the string. – Earth May 04 '15 at 12:12
Ok, you can use `.Trim()` to get rid of intial and final whitespace. – Wiktor Stribiżew May 04 '15 at 12:12
I am trying with `trim()`. Meanwhile I am posting here that `html string` in the next comment since too long. – Earth May 04 '15 at 12:16
`
This is the First Sentence.

` – Earth May 04 '15 at 12:16
Yes. `Trim` I used and I am getting the expected output. I also checked with two other complex `html strings`. Working fine as expected. Thanks for your valuable time and help. – Earth May 04 '15 at 12:23
1

Great. I am not against using a regex, esp. in cases like this. Perhaps, you can also try with `HtmlDocument`, too. I will test, and let you know. – Wiktor Stribiżew May 04 '15 at 12:29

score -3 · Answer 2 · edited May 04 '15 at 10:45

-3

Considering the fact the input is a plain string

string x = "<p>First Sentence is this.&#160;Second sentence is this.</p>";
x= x.Replace("&#160;"," ");

This is way too simple, but will work.

edited May 04 '15 at 10:45

Uwe Keim

39,551
56
175
291

answered May 04 '15 at 10:45

Newton Sheikh

1,376
2
19
42

Complete HTML Strip function

2 Answers2

Linked