3

I am trying to find the index of Mauricio in a string that is downloaded from a website using webclient and download string. However, on the website it contains a foreign character, Maurício. So I found elsewhere some code

string ToASCII(string s)
{
return String.Join("",
     s.Normalize(NormalizationForm.FormD)
    .Where(c => char.GetUnicodeCategory(c) != UnicodeCategory.NonSpacingMark));
}

that converts foreign characters. I have tested the code and it works. So the problem I have is that when I download the string, it downloads as MaurA-cio. I have tried both

wc.Encoding = System.Text.Encoding.UTF8; wc.Headers.Add("Accept-Charset", "ISO-8859-1,utf-8;q=0.7,*;q=0.7");

Neither stop it from downloading as MaurA-cio.

(Also, I cannot change the search as I am getting the search term from a list).

What else can I try? Thanks

gabagool
  • 640
  • 1
  • 7
  • 18
  • Shouldn't you call the normalize outside the join? – Eric Nov 16 '14 at 02:48
  • I don't think so. As is, it properly converts Maurício to Mauricio – gabagool Nov 16 '14 at 02:50
  • Correct. Right now using downloadstring, foreign characters do not download properly. í becomes A- – gabagool Nov 16 '14 at 03:00
  • Normalization doesn't convert characters not representable by ASCII into characters that are, and you even seem to be saying that the `ToASCII` method _doesn't_ work for at least one string ("Mauricio"). What is it you are actually trying to accomplish? Why did you introduce that method `ToASCII` to your code in the first place (since it doesn't really convert strings to ASCII)? If you _are_ trying to convert strings to ASCII, what do you expect to do with the string "Mauricio", given that it cannot be represented in ASCII? – Peter Duniho Nov 16 '14 at 03:43
  • the comment to [this answer](http://stackoverflow.com/a/4716548/815938) suggested this might be a bug in .NET 3.5; – kennyzx Nov 16 '14 at 03:54

2 Answers2

10
var client = new WebClient { Encoding = System.Text.Encoding.UTF8 };

var json = client.DownloadString(url);

this one will work for any character

Sanket Patel
  • 901
  • 9
  • 21
2

DownloadString doesn't look at HTTP response headers. It uses the previously set WebClient.Encoding property. If you have to use it, get the headers first:

// call twice 
// (or to just do a HEAD, see http://stackoverflow.com/questions/3268926/head-with-webclient)
webClient.DownloadString("http://en.wikipedia.org/wiki/Maurício");
var contentType = webClient.ResponseHeaders["Content-Type"];
var charset = Regex.Match(contentType,"charset=([^;]+)").Groups[1].Value;

webClient.Encoding = Encoding.GetEncoding(charset);
var s = webClient.DownloadString("http://en.wikipedia.org/wiki/Maurício");

BTW--Unicode doesn't define "foreign" characters. From Maurício's perspective, "Mauricio" would be the foreign spelling of his name.

Tom Blodget
  • 20,260
  • 3
  • 39
  • 72