2

This could be a duplicate question, but I have no idea what search terms to look up, so don't be hard on me if it has been asked before (and I'm pretty sure it was).

So I am getting a web page's source code using the WebClient class and saving the entire string in the source variable:

var client = new WebClient();
client.Headers.Add("user-agent", "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.2; .NET CLR 1.0.3705;)");
var data = client.OpenRead(urlAddress);
var reader = new StreamReader(data);
var source = reader.ReadToEnd();
data.Close();
reader.Close();

Now I want to process certain text ranges from the source variable, especially user posted messages. Now the problem is that in the web pages source "&" is actually &, "'" is ’ and quotes (") are either –, “, ” and who knows what else.

Well, I could replace those codes with the actual symbols using the Replace string method, but I would like to know if there is a way to convert all those codes to the actual (expected) symbols. Is there a method that can do that, or maybe a library or some utility class on the Internet?

Mogsdad
  • 44,709
  • 21
  • 151
  • 275
IneedHelp
  • 1,630
  • 1
  • 27
  • 58
  • The term describing what you are seeing is "HTML encoding": http://en.wikipedia.org/wiki/Character_encodings_in_HTML – Jesse Webb Sep 11 '12 at 16:47
  • Thank you for the reference. Now I also learned that this thread could answer my question http://stackoverflow.com/questions/122641/how-can-i-decode-html-characters-in-c – IneedHelp Sep 11 '12 at 17:06

1 Answers1

4

Try using HttpUtility.HtmlDecode or HttpServerUtility.HtmlDecode.

Justin Niessner
  • 242,243
  • 40
  • 408
  • 536