EDIT: The characters come correctly, but in the middle of the page there's this line <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0//EN" "http://www.w3.org/TR/REC-html40/strict.dtd">
, afterwards the special characters become é
as é
(that are represented fine in browser), but are represented as eacute;
(without the &) if downloaded via WebClient. END EDIT
I am extracting an excerpt from a web using WebClient + RegEx.
But setting the encoding correctly still makes é
as eacute;
, ç
as ccedil;
, í
as iacute;
etc.
I followed DownloadString and Special Characters example to correctly set the charset (ISO-8859-1
):
System.Net.WebClient wc = new System.Net.WebClient();
wc.DownloadString("https://myurl"); //
var contentType = wc.ResponseHeaders["Content-Type"];
var charset = Regex.Match(contentType, "charset=([^;]+)").Groups[1].Value;
wc.Encoding = Encoding.GetEncoding(charset);
It does set charset
like the document's (ISO-8859-1
), but when i do the follow-up DownloadString
(i know i could set the encoding before and just do one wc.DownloadString
, but i wanted to follolw the accepted answer's example):
string result = wc.DownloadString("https://myurl");
The special characters still come wrong.
NOTE: I am using a non-English Windows 10 (if it's relevant)
NOTE 2: The page's special characters appear correctly in any browser
My question is, why the WebClient
don't download correctly even with the correct charset set?