2

I want to translate a string in various languages with google and without api in C#. This is my code:

public string TranslateWithGoogle(string input, string languagePair)
{
    try
    {
        string url = String.Format("http://www.google.com/translate_t?hl=en&ie=UTF8&text={0}&langpair={1}", input, languagePair);
        WebClient webClient = new WebClient();
        webClient.Encoding = System.Text.Encoding.Default;
        string result = webClient.DownloadString(url);
        result = result.Substring(result.IndexOf("<span title=\"") + "<span title=\"".Length);
        result = result.Substring(result.IndexOf(">") + 1);
        result = result.Substring(0, result.IndexOf("</span>"));
        return result.Trim();
    }
    catch (Exception exc)
    {
        MessageBox.Show(exc.ToString());
        return string.Empty;
    }
        
}

so now when it comes to testing with C# vs directly the browser I use this code:

string strSource_String = "Debug offline mode";
string strSource_Language = "en";
string str_It = TranslateWithGoogle(strSource_String, strSource_Language+"|it");
string str_Fr = TranslateWithGoogle(strSource_String, strSource_Language + "|fr");
string str_De = TranslateWithGoogle(strSource_String, strSource_Language + "|de");
string str_Ru = TranslateWithGoogle(strSource_String, strSource_Language + "|ru");
string str_Bg = TranslateWithGoogle(strSource_String, strSource_Language + "|bg");
string str_Cz = TranslateWithGoogle(strSource_String, strSource_Language + "|cz");
string str_Pl = TranslateWithGoogle(strSource_String, strSource_Language + "|pl");

and the result C#/browser is:

IT

C#: "Esegui il debug in modalità offline"

Browser: "Esegui il debug in modalità offline"

OK! and also the à char is correct

FR

C#: "Déboguer le mode hors connexion"

Browser: "Déboguer le mode hors connexion"

OK! and also the é char is correct

Russian

C#: "Ðåæèì îòëàäêè â àâòîíîìíîì ðåæèìå"

Browser: "Режим отладки в автономном режиме"

Wrong :-(

and the same problem with Bulgarian and Czech language. I have tried to change all webClient.Encoding = System.Text.Encoding.Default; options but that was no help.

Thanks for helping

Patrick

Community
  • 1
  • 1
Patrick
  • 3,073
  • 2
  • 22
  • 60
  • 1
    Also consider using the HTML Agility Pack (https://stackoverflow.com/questions/846994/how-to-use-html-agility-pack) to do your HTML parsing. The way you are doing it now is pretty odd. – mjwills Jun 21 '18 at 08:01
  • with UTF8 nothing works not even à or è – Patrick Jun 21 '18 at 08:04
  • 2
    If you check the header section of the returned HTML you will see that it uses charset _"[windows-1251](https://en.wikipedia.org/wiki/Windows-1251)"_ - which is specifically for the Cyrillic characters. You need to set the encoding for that. – PaulF Jun 21 '18 at 08:27
  • Sounds fair!! Ok but how to do that? I have set webClient.Encoding = System.Text.Encoding.UTF8; and the result in fact is " – Patrick Jun 21 '18 at 08:37
  • 1
    One way would be to check the charset after the first read, if it is not the default set the correct encoding & download again. If you do set the correct encoding you do get the Cyrillic characters. After first download insert _"if (result.Contains("windows-1251")) { webClient.Encoding = System.Text.Encoding.GetEncoding("windows-1251"); result = webClient.DownloadString(url); }"_ for example – PaulF Jun 21 '18 at 08:49
  • Ok but this is not the matter. I am convinced that this might work but despite my setting UTF8 that is windows 1251 I still get wrong chars... any idea on how to solve? – Patrick Jun 21 '18 at 09:00
  • UTF-8 is more closely related to windows-1252 & generally (though not always) they are interchangeable - but windows-1251 is NOT similar which is why you get the wrong characters when using UTF-8 – PaulF Jun 21 '18 at 09:10
  • Thanks soo... what to do to use 1252? – Patrick Jun 21 '18 at 10:13
  • See my comment above - it is a quick & dirty check for the encoding, you may want to modify it to ensure that the _"windows-1251"_ is in the header section - but otherwise I have checked it & it works for your code for different languages. – PaulF Jun 21 '18 at 10:26
  • Wonderful!! Working with encoding = Encoding.GetEncoding("windows-1251"); Why not posting it as an answer? – Patrick Jun 21 '18 at 10:38

1 Answers1

4

If you check the header section of the returned HTML you will see that it uses charset "windows-1251" - which is specifically for the Cyrillic characters. You need to set the encoding for that.

There may be better ways to get header information prior to downloading the page, but if you are happy to download the page twice - then you could check the charset used & if it is "windows-1251", then change the encoding & download again.

Something like :

string result = webClient.DownloadString(url);
if (result.Contains("windows-1251"))
{
  webClient.Encoding = System.Text.Encoding.GetEncoding("windows-1251");
  result = webClient.DownloadString(url);
}
else if (result.Contains("ISO-8859-2"))
{
  webClient.Encoding = System.Text.Encoding.GetEncoding("ISO-8859-2");
  result = webClient.DownloadString(url);
}

you may want to modify it to ensure that the "windows-1251" is in the header section

PaulF
  • 6,673
  • 2
  • 18
  • 29
  • I did notice that the Czech version did not translate at all - not matching the _" – PaulF Jun 21 '18 at 11:12
  • Yes that is what I did, just not finding the correct char set – Patrick Jun 21 '18 at 11:25
  • 1
    What I found was I needed to use "|cs" rather than "|cz" to get the Czech translation to work & it used ISO-8859-2 encoding. – PaulF Jun 21 '18 at 11:28
  • Looking at Google translate - I see that it still uses "cs" which was the code for the former country of "Czechoslovakia" which no longer exists & "cs" is officially no longer used. – PaulF Jun 21 '18 at 11:46