1

The following has been amusing me for a while now.

First of all, I have been scraping sites for a couple of months. Among them hebrew sites as well, and had no problem whatsoever in receiving hebrew characters from the http server.

For some reason I am very curious to sort out, the following site is an exception. I can't get the characters properly encoded. I tried emulating the working requests I do via Fiddler, but to no avail. My c# request headers look exactly the same, but still the characters will not be readable.

What I do not understand is why I have always been able to retrieve hebrew characters from other sites, while from this one specifically I am not. What is this setting that is causing this.

Try the following sample out.

    HttpClient httpClient = new HttpClient();
    httpClient.DefaultRequestHeaders.TryAddWithoutValidation("User-Agent", "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:45.0) Gecko/20100101 Firefox/45.0");
    //httpClient.DefaultRequestHeaders.TryAddWithoutValidation("Accept", "text/html;q=0.9");
    //httpClient.DefaultRequestHeaders.TryAddWithoutValidation("Accept-Language", "en-US,en;q=0.5");
    //httpClient.DefaultRequestHeaders.TryAddWithoutValidation("Accept-Encoding", "gzip, deflate");

    var getTask = httpClient.GetStringAsync("http://winedepot.co.il/Default.asp?Page=Sale");

    //doing it like this for the sake of the example
    var contents = getTask.Result;

    //add a breakpoint at the following line to check the contents of "contents"
    Console.WriteLine();

As mentioned, such code works for any other israeli site I try - say, Ynet news site, for instance.


Update: I figured out while "debugging" with Fiddler that the response object, for the ynet site (one which works), returns the header

Content-Type: text/html; charset=UTF-8

while this header is absent in the response from winedepot.co.il

I tried adding it, but still made no difference.

 var getTask = httpClient.GetAsync("http://www.winedepot.co.il");

    var response = getTask.Result;

    var contentObj = response.Content;
    contentObj.Headers.Remove("Content-Type");
    contentObj.Headers.Add("Content-Type", "text/html; charset=UTF-8");

    var readTask = response.Content.ReadAsStringAsync();
    var contents = readTask.Result;
    Console.WriteLine();
CodeCaster
  • 147,647
  • 23
  • 218
  • 272
Veverke
  • 9,208
  • 4
  • 51
  • 95
  • `winedepot.co.il` uses `windows-1255 (hebrew)` and not UTF-8. – haim770 Mar 31 '16 at 08:20
  • @haim770: thanks Haim. How do you conclude that ? Where do you see this ? By the way, I would think UTF-8 would cover most character sets (but I have almost no knowledge in charsets, anyway.... ) – Veverke Mar 31 '16 at 08:46
  • They included `` in their ``. Your browser can also tell you the current encoding of the website. – haim770 Mar 31 '16 at 08:47
  • Gotcha... the meta tag... forgot that it can also set the charset. Thanks. Fixed it. Please post this as an answer, so I can give you the points. – Veverke Mar 31 '16 at 08:55
  • So UTF-8 does not "cover" windows-1255 mappings ? In other words, the char mappings are different ? – Veverke Mar 31 '16 at 08:55
  • I checked Ynet's site and indeed there is a meta tag there that sets the charset to UTF-8. This however generates another question: why is it that in sites with meta tag setting the charset to UTF-8 seems the http response object is passed the corresponding charset setting, while in this example (winedepot) the reponse object content's header does not add the chartset setting ? After all, this is a setting that comes in the meta tag for whatever site that has a special charset - for the very purpose of telling the response in which charset it should be created ? – Veverke Mar 31 '16 at 09:04
  • @haim770: In case you have an answer for [this](http://stackoverflow.com/q/36329642/1219280) one as well... – Veverke Mar 31 '16 at 10:09

1 Answers1

3

The problem you're encountering is that the webserver is lying about its content-type, or rather, not being specific enough.

The first site responds with this header:

Content-Type: text/html; charset=UTF-8

The second one with this header:

Content-Type: text/html

This means that in the second case, your client will have to make assumptions about what encoding the text is actually in. To learn more about text encodings, please read The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!).

And the built-in HTTP clients for .NET don't really do a great job at this, which is understandable, because it is a Hard Problem. Read the linked article for the trouble a web browser will have to go through in order to guess the encoding, and then try to understand why you don't want this logic in a programmable web client.

Now the sites do provide you with a <meta http-equiv="Content-Type" content="actual encoding here" /> tag, which is a nasty workaround for not having to properly configure a web server. When a browser encounters such a tag, it will have to restart parsing the document with the specified content-type, and then hope it is correct.

The steps roughly are, assuming an HTML payload:

  1. Perform web request, keep the response document in a binary buffer.
  2. Inspect the content-type header, if present, and if it isn't present or doesn't provide a charset, do some assumption about the encoding.
  3. Read the response by decoding the buffer, and parsing the resulting HTML.
  4. When encountering a <meta http-equiv="Content-Type" /> header, discard all decoded text, and start again by interpreting the binary buffer as text encoded in the specified encoding.

The C# HTTP clients stop at step 2, and rightfully so. They are HTTP clients, not HTML-displaying browsers. They don't care that your payload is HTML, JSON, XML, or any other textual format.

When no charset is given in the content-type response header, the .NET HTTP clients default to the ISO-8859-1 encoding, which cannot display the characters from the character set Windows-1255 (Hebrew) that the page actually is encoded in (or rather, it has different characters at the same code points).

Some C# implementations that try to do encoding detection from the meta HTML element are provided in Encoding trouble with HttpWebResponse. I cannot vouch for their correctness, so you'll have to try it at your own risk. I do know that the currently highest-voted answer actually re-issues the request when it encounters the meta tag, which is quite silly, because there is no guarantee that the second response will be the same as the first, and it's just a waste of bandwidth.

You can also do some assumption about that you know the encoding being used for a certain site or page, and then force the encoding to that:

using (Stream resStream = response.GetResponseStream())
{
    StreamReader reader = new StreamReader(resStream, YourFixedEncoding);
    string content = reader.ReadToEnd();
}

Or, for HttpClient:

using (var client = new HttpClient())
{
    var response = await client.GetAsync(url);
    var responseStream = await client.ReadAsStreamAsync();
    using (var fixedEncodingReader = new StreamReader(responseStream, Encoding.GetEncoding(1255)))
    {
        string responseString = fixedEncodingReader.ReadToEnd();
    }
}

But assuming an encoding for a particular response, or URL, or site, is entirely unsafe altogether. It is in no way guaranteed that this assumption will be correct every time.

Community
  • 1
  • 1
CodeCaster
  • 147,647
  • 23
  • 218
  • 272
  • Thanks for the detailed answer. Allow me to annoy you once again, though: why is `charset: windows-1255` being ignored in the 2nd response ? What is this, a bug ?! If only HttpClient framework would relate to it, everything would (should) be fine !!! – Veverke Mar 31 '16 at 11:27
  • Because, as I've said before: **HTTP clients do not read HTML tags**. They only read HTTP response headers, which I've tried to explain in [my answer to your previous question](http://stackoverflow.com/a/36330581/266143). – CodeCaster Mar 31 '16 at 11:27
  • So bottom line: yes, .NET does not provide you a way to assuredly get a content property encoded. Neither .NET neither anyone else, since this is not possible. (I was not expecting this to be the answer at all, thus my reluctance...). You have your +1. – Veverke Mar 31 '16 at 11:50
  • I'm not saying it is impossible. It's just very hard to get right. See the link and the code at the end of my answer for some attempts that _may_ work for _most_ cases... – CodeCaster Mar 31 '16 at 11:53
  • 1
    Did not go over all your answers. Will do sometime later. (now the more respected SO users frown their eyes and whim "no, these guys never really want to learn what the real problem was!"...) Thank you for being insistent as well. I remember you from other posts, I honestly would not singe out your *answering methodology* among the best I've seen, but at least you will stand up to make your point until proven wrong. – Veverke Mar 31 '16 at 11:57
  • Thanks for that, I guess. I like to explain the problem instead of dumping "fixed" code that _might_ help you further this instance, but will not be useful for others and/or hurt you or your project in the near future. – CodeCaster Mar 31 '16 at 11:58
  • I understand that and it has its virtues for sure, I myself think many users here have taken this too far. We are after all only trying to solve someone's problem. – Veverke Mar 31 '16 at 12:04
  • @Veverke this site does not exist to help one developer over their acute problem. It is supposed to be a library full of information that will not only help the original asker further, but also be of value for future visitors. I do not care that you currently have this problem, and I am not interested in helping you over this specific problem. I want to explain **why** you encounter this problem, and why it is not trivial to fix. I explained the core problem thoroughly, and linked you to some possible solutions that **may** help you, **if** you understand their implementation and drawbacks. – CodeCaster Mar 31 '16 at 13:34
  • @Veverke alternatively I could have dumped some code like _"Use this code to read a HTTP GET response as a Windows-1255 encoded string"_, but you would have learned nothing, and have come back for the first page that uses a different encoding and would have broken that code. – CodeCaster Mar 31 '16 at 13:38
  • We basically disagree on some fundamental approaches to life, I guess :). And, if you have dumped a working code - it would answer no question - this is the reason I posted the 2nd one - because @Haim770 here solved (oops, you hate this terminology) the problem - but not answered my question. See, one can live with a solution to a problem and still have a question - and a desire to learn. – Veverke Mar 31 '16 at 13:48
  • @Veverke how we utilize some website does not define our approach to life. And please don't put words in my mouth, nowhere am I saying that I don't want to solve problems. Your problem here (_"How can I properly read Windows-1255 encoded text from an HttpClient"_) was also not solved in comments; they merely provided a hint on where this encoding was actually specified. – CodeCaster Mar 31 '16 at 13:51
  • All this *approach* in handling user posts is not yours exclusively, it is a widespread trend one sees. I have nothing to do with SO's philosophy. My opinion ? People started losing by a lot contact with reality. The idea you depict is very utopic and amazing. I hope this is what happens, and you find fulfillment in you making the world a better place via SO. What I think. Just do not ask me (and others) to share and embrace such beliefs. – Veverke Mar 31 '16 at 13:52
  • @Veverke I honestly do not understand what else you would have expected of me. What do you think is wrong with my answer here, and what kind of answer do you want to have gotten? – CodeCaster Mar 31 '16 at 13:54
  • Come on... the problem posed by this post is - the response I get to this url is not properly encoded. Came this Haim770 and put me back on track in one or two comments - with no supernatural pretensions. A new question popped in my mind, and I posted another one. As simple as that. (yes) – Veverke Mar 31 '16 at 13:55
  • It's just how to you come and deal with what's happening here (a guy raising a hand "hey, you, who is more knowleadgeable than me (and you definitely are) in these stuff, can you tell me what I am doing wrong ?). It's you (and other guys) taking it to places way beyond (in my judgement). Anyway, this is my opnion, but I have seen many think like you, so chances are that I am the one who is wrong. – Veverke Mar 31 '16 at 13:58
  • I again thank you for trying to help, just try to take it a little bit easier next time, I would suggest, were we friends. – Veverke Mar 31 '16 at 13:59
  • 1
    Well I'm very sorry for trying to thoroughly explain the problem you're encountering. Next time I'll dump a link in comments and let you figure it out for yourself, if that's what you're after. If not, I really don't know what you're getting at. – CodeCaster Mar 31 '16 at 14:05
  • @CodeCaster, `` is not a "nasty workaround for not having to properly configure a web server". There's no way the web server will know the encoding of all the files it serves. See more on http://www.joelonsoftware.com/articles/Unicode.html – haim770 Mar 31 '16 at 14:05
  • @haim it _is_ a nasty workaround, because now your HTTP client, which doesn't have to know anything about HTML (or any other content type), will have to learn HTML in order to properly decode the payload as a string - because the server omits the encoding information. Maybe you're missing the sarcastic tone of that paragraph in the article that I also link to in my answer. There should be no need for the web server to know about every document: you can set a content-type per directory, and make sure your documents are stored in the proper encoding in the first place. – CodeCaster Mar 31 '16 at 14:09
  • @CodeCaster, The way I see it, it was not written in a sarcastic tone. It *is* a valid argument. Although the web server *should* include the `content-type` header, sometimes it *can't*. And for that, a web browser can make a very legitimate use of the `` tag. – haim770 Mar 31 '16 at 14:12
  • @haim yes, a **web browser**. We're talking about HTTP clients, which are not that. Also, Joel's **2003** article pretty clearly points out that **every** OS supports Unicode, and that it's silly to store non-ASCII data in anything other than Unicode. Let alone 13 years later, when no website alive should be using ANSI character sets anymore. That it's supported and possible to use that meta tag doesn't make it any less of a hack. – CodeCaster Mar 31 '16 at 14:13