0

EDIT: The characters come correctly, but in the middle of the page there's this line <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0//EN" "http://www.w3.org/TR/REC-html40/strict.dtd">, afterwards the special characters become é as &eacute; (that are represented fine in browser), but are represented as eacute; (without the &) if downloaded via WebClient. END EDIT

I am extracting an excerpt from a web using WebClient + RegEx.

But setting the encoding correctly still makes é as eacute;, ças ccedil;, í as iacute; etc.

I followed DownloadString and Special Characters example to correctly set the charset (ISO-8859-1):

System.Net.WebClient wc = new System.Net.WebClient();
wc.DownloadString("https://myurl"); //
var contentType = wc.ResponseHeaders["Content-Type"];
var charset = Regex.Match(contentType, "charset=([^;]+)").Groups[1].Value;
wc.Encoding = Encoding.GetEncoding(charset);

It does set charset like the document's (ISO-8859-1), but when i do the follow-up DownloadString (i know i could set the encoding before and just do one wc.DownloadString, but i wanted to follolw the accepted answer's example):

string result = wc.DownloadString("https://myurl");

The special characters still come wrong.

NOTE: I am using a non-English Windows 10 (if it's relevant)

NOTE 2: The page's special characters appear correctly in any browser

My question is, why the WebClient don't download correctly even with the correct charset set?

Tiago
  • 365
  • 1
  • 4
  • 17
  • I expect this is not a problem of your client. Did you try the same in a different client? For example you can use some browser extensions to tailor web requests as you wish. What I want to say is that the web server maybe simply sends the data in this form. – Al Kepp Nov 10 '17 at 10:22
  • WebClient.Encoding is used to convert unknown characters to unicode, which is .NET's internal form. But when your web server sends eacute;, it is just a standard text. None of its parts are unknown. It is regular e, regular a, regular c... What you probably want to get is something like HtmlDecode, which is something completely different than this Encoding stuff. – Al Kepp Nov 10 '17 at 10:26
  • The page opens correctly in any browser, it is just when i access through WebClient that this happens. Perhaps i should add some headers to the WebClient (user-agent, accept, etc)? **EDIT**: Setting headers didn't work also. – Tiago Nov 10 '17 at 10:35
  • I just looked better at the page source, there's this line ` ` in the middle of the document, afterwards special characters become like i said in my question (although it represents fine in browser), but with **&** (i.e `ç`). – Tiago Nov 10 '17 at 11:12
  • 1
    I didn't mean open in browser. You must look at bare data, beacuse each browser processes those é and similar things, as it is normal valid representation of é etc. This is normal behavior: If page source contains é, it is sent in this form to browser and the browser must decode it itself to é. Your program must decode it manually too. – Al Kepp Nov 10 '17 at 14:31

1 Answers1

-1

using System.Text;

wc.Encoding = Encoding.UTF8;

  • The [Markdown help](https://stackoverflow.com/editing-help) will help you format your answer properly, but you really should add explanatory text to your code also. – Wai Ha Lee Apr 24 '19 at 21:19