The following has been amusing me for a while now.
First of all, I have been scraping sites for a couple of months. Among them hebrew
sites as well, and had no problem whatsoever in receiving hebrew
characters from the http
server.
For some reason I am very curious to sort out, the following site is an exception. I can't get the characters properly encoded. I tried emulating the working requests I do via Fiddler
, but to no avail. My c#
request headers look exactly the same, but still the characters will not be readable.
What I do not understand is why I have always been able to retrieve hebrew
characters from other sites, while from this one specifically I am not. What is this setting that is causing this.
Try the following sample out.
HttpClient httpClient = new HttpClient();
httpClient.DefaultRequestHeaders.TryAddWithoutValidation("User-Agent", "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:45.0) Gecko/20100101 Firefox/45.0");
//httpClient.DefaultRequestHeaders.TryAddWithoutValidation("Accept", "text/html;q=0.9");
//httpClient.DefaultRequestHeaders.TryAddWithoutValidation("Accept-Language", "en-US,en;q=0.5");
//httpClient.DefaultRequestHeaders.TryAddWithoutValidation("Accept-Encoding", "gzip, deflate");
var getTask = httpClient.GetStringAsync("http://winedepot.co.il/Default.asp?Page=Sale");
//doing it like this for the sake of the example
var contents = getTask.Result;
//add a breakpoint at the following line to check the contents of "contents"
Console.WriteLine();
As mentioned, such code works for any other israeli site I try - say, Ynet news site, for instance.
Update: I figured out while "debugging" with Fiddler
that the response object, for the ynet site (one which works), returns the header
Content-Type: text/html; charset=UTF-8
while this header is absent in the response from winedepot.co.il
I tried adding it, but still made no difference.
var getTask = httpClient.GetAsync("http://www.winedepot.co.il");
var response = getTask.Result;
var contentObj = response.Content;
contentObj.Headers.Remove("Content-Type");
contentObj.Headers.Add("Content-Type", "text/html; charset=UTF-8");
var readTask = response.Content.ReadAsStringAsync();
var contents = readTask.Result;
Console.WriteLine();