Google simply ignores encoding sent in AcceptCharset
headers and returns response in ISO-8859-1
, as you can see from shortened response:
HTTP/1.1 200 OK
Content-Type: text/html; charset=ISO-8859-1
Content-Language: en
Content-Length: 64202
<!DOCTYPE html><html><head><meta content="text/html; charset=ISO-8859-1" http-equiv="content-type">
Therefore when you decode response using UTF-8 encoding, you get invalid characters. If you want just to make it work quickly, I have found that when User-Agent
header is added to request, Google returns response in UTF-8 and you can leave rest of code unmodified:
private static string translate(string input, string languagePair)
{
string url = String.Format("https://translate.google.com/?hl=en&ie=UTF8&text={0}&langpair={1}", input, languagePair);
WebClient wc = new WebClient();
wc.Headers.Add(HttpRequestHeader.AcceptCharset, "utf-8");
wc.Headers.Add(HttpRequestHeader.UserAgent, "Mozilla/5.0 (Windows NT 10.0; …) Gecko/20100101 Firefox/55.0");
wc.Encoding = Encoding.UTF8;
string result = wc.DownloadString(url);
int start = result.IndexOf("result_box");
string sub = result.Substring(start);
sub = sub.Substring(0, sub.IndexOf("</span>"));
start = sub.LastIndexOf(">");
sub = sub.Substring(start + 1);
return sub;
}
Better solution is to detect encoding used in response and use it for decoding. WebClient
does not have this detection built-in, so you can either use solution described here or use HttpClient
instead, which does this for you automatically:
private static async Task<string> translate(string input, string languagePair)
{
string url = String.Format("https://translate.google.com/?hl=en&ie=UTF8&text={0}&langpair={1}", input, languagePair);
using (var hc = new HttpClient())
{
var result = await hc.GetStringAsync(url).ConfigureAwait(false);
int start = result.IndexOf("result_box");
string sub = result.Substring(start);
sub = sub.Substring(0, sub.IndexOf("</span>"));
start = sub.LastIndexOf(">");
sub = sub.Substring(start + 1);
return sub;
}
}
Also please note that Google has Translation API, which might be better to use rather than parsing translation from HTML page.