4

Hey :) I'm trying really hard to make WebClient return me UTF-8. But when sub should return something like Ä it's more a E or so I think.

Gave a lot of workarounds a try, but It won't work.

private string translate(string input, string languagePair)
{
    string url = String.Format("https://translate.google.com/?hl=en&ie=UTF8&text={0}&langpair={1}", input, languagePair);
    WebClient wc = new WebClient();
    wc.Headers.Add(HttpRequestHeader.AcceptCharset, "UTF-8");
    wc.Encoding = Encoding.UTF8;
    var data = wc.DownloadData(url);
    var result = Encoding.UTF8.GetString(data);
    //string result = wc.DownloadString(url);
    int start = result.IndexOf("result_box");
    string sub = result.Substring(start);
    sub = sub.Substring(0, sub.IndexOf("</span>"));
    start = sub.LastIndexOf(">");
    sub = sub.Substring(start + 1);
    return sub;
}
Morteza Asadi
  • 1,819
  • 2
  • 22
  • 39
koin
  • 203
  • 2
  • 12

1 Answers1

7

Google simply ignores encoding sent in AcceptCharset headers and returns response in ISO-8859-1, as you can see from shortened response:

HTTP/1.1 200 OK
Content-Type: text/html; charset=ISO-8859-1
Content-Language: en
Content-Length: 64202

<!DOCTYPE html><html><head><meta content="text/html; charset=ISO-8859-1" http-equiv="content-type">

Therefore when you decode response using UTF-8 encoding, you get invalid characters. If you want just to make it work quickly, I have found that when User-Agent header is added to request, Google returns response in UTF-8 and you can leave rest of code unmodified:

private static string translate(string input, string languagePair)
{
    string url = String.Format("https://translate.google.com/?hl=en&ie=UTF8&text={0}&langpair={1}", input, languagePair);
    WebClient wc = new WebClient();
    wc.Headers.Add(HttpRequestHeader.AcceptCharset, "utf-8");
    wc.Headers.Add(HttpRequestHeader.UserAgent, "Mozilla/5.0 (Windows NT 10.0; …) Gecko/20100101 Firefox/55.0");
    wc.Encoding = Encoding.UTF8;
    string result = wc.DownloadString(url);
    int start = result.IndexOf("result_box");
    string sub = result.Substring(start);
    sub = sub.Substring(0, sub.IndexOf("</span>"));
    start = sub.LastIndexOf(">");
    sub = sub.Substring(start + 1);
    return sub;
}

Better solution is to detect encoding used in response and use it for decoding. WebClient does not have this detection built-in, so you can either use solution described here or use HttpClient instead, which does this for you automatically:

private static async Task<string> translate(string input, string languagePair)
{
    string url = String.Format("https://translate.google.com/?hl=en&ie=UTF8&text={0}&langpair={1}", input, languagePair);
    using (var hc = new HttpClient())
    {
        var result = await hc.GetStringAsync(url).ConfigureAwait(false);
        int start = result.IndexOf("result_box");
        string sub = result.Substring(start);
        sub = sub.Substring(0, sub.IndexOf("</span>"));
        start = sub.LastIndexOf(">");
        sub = sub.Substring(start + 1);
        return sub;
    }
}

Also please note that Google has Translation API, which might be better to use rather than parsing translation from HTML page.

Ňuf
  • 6,027
  • 2
  • 23
  • 26