7

I'm trying to pull an audio file from google's text-to-speech function. Basically, you toss in the link and then concat whatever you want to be spoken at the end of it. I've gotten the below code to work just fine for English, so I think the problem must be how the Chinese characters are getting encoded in the request. Here's what I've got:

String text = "text to be spoken";
public static final String AUDIO_CHINESE= "http://www.translate.google.com/translate_tts?tl=zh&q=";
public static final String AUDIO_ENGLISH = "http://www.translate.google.com/translate_tts?tl=en&q=";

URL url = new URL(AUDIO_ENGLISH + text);

urlConnection = (HttpURLConnection) url.openConnection();
urlConnection.setRequestMethod("GET");
urlConnection.setRequestProperty("Accept-Charset", Variables.UTF_8);

if (urlConnection.getResponseCode() ==200) {
     //get byte array in response
     in = new DataInputStream(urlConnection.getInputStream());
} else {
     in = new DataInputStream(urlConnection.getErrorStream());
}
//use commons io
byte[] bytes = IOUtils.toByteArray(in);

in.close();
urlConnection.disconnect();

return bytes;

When I try this with Chinese characters, though, it returns something that I can't get to play in the mediaplayer (I suspect it's not a proper audio file as the vast majority of bytes are '85'). So I've tried both

String chText = "你好";
URL url = new URL(AUDIO_CHINESE + URLEncoder.encode(chText, "UTF-8));

and

URL url = new URL(AUDIO_CHINESE + Uri.encode(chText, "UTF-8"));

and then adding

urlConnection.setRequestProperty("content-type", "application/x-www-form-urlencoded; charset=UTF-8");

to the request header. This just made it worse, though, because now it doesn't even return a 200 code, instead stating "FileNotFound" in logcat.

So on a whim, I went back and tried the URL/Uri encoding with the English text, and now the English won't return a valid result either. Not sure what's going on here: the raw url in the debugger works fine if I copy and paste into Chrome, but for some reason the urlConnection just doesn't work. Feel like I'm missing something obvious.

EDIT

Fiddling with it some more has revealed no answer, just more confusion (and exasperation). For some reason, when sent over httpurlconnection, the Google tts machine reads the utf-8 percent-encoded text as utf-16, at least as far as I can tell. For example, the character "維" (wei2) is %E7%B6%AD, but if you pass it through the connection, you'll get a file that pronounces "see" ("ç", to be precise).

ç, as it turns out, is 0x00E7 in UTF-16 (its utf-8 percent-encoded version is %C3%A7). I have no idea why it does that in Java, because putting the appropriate % at the end of the link in any browser will work properly. Thus far, I have tried various combinations of trying to get the tts to read the entirety of %E7%B6%AD without much success.

EDIT2

Solution to my problem found! See below for answer. The problem wasn't in the encoding, it was in the parsing on Google's end. Have edited the title accordingly. Cheers!

Matter Cat
  • 1,538
  • 1
  • 14
  • 23
  • Are you sure you're not entering the `else` part of your response code check? Maybe you're trying to play the contents of an error message. Try to add some logging to see the actual response headers and body. In addition, try to log `URL.toString()` after you've constructed the URL for each of your attempts and paste that into a browser to see what happens. – Zoltán Jan 27 '15 at 09:30
  • You should also verify that the response type is `audio/mpeg`. – Zoltán Jan 27 '15 at 09:32
  • Just checked: all response types are audio/mpeg. The top code alone with a Chinese string enters the 200 code section, while the URL/Uri encoding stuff throws me into the error sections. Tried url.toString() with un-encoded text, and the resulting url http://www.translate.google.com/translate_tts?tl=zh&q=由代表物體、抽象事物 works just fine. – Matter Cat Jan 27 '15 at 09:40
  • 1
    The problem most likely is URL encoding. Even though you are getting `translate.google.com/translate_tts?tl=zh&q=由代表物體、抽象事物` as a result of `URL.toString()`, i don't think it's a valid escaped URL, it's just that the browser knows how to escape it. You should take a look at the answers to [this question](http://stackoverflow.com/questions/724043/http-url-address-encoding-in-java). – Zoltán Jan 27 '15 at 10:04
  • In Firefox debugger, after right-clicking on the request in the network tab, selecting `copy as cURL`, I got this URL: `http://translate.google.com/translate_tts?tl=zh&q=%E7%94%B1%E4%BB%A3%E8%A1%A8%E7%89%A9%E9%AB%94%E3%80%81%E6%8A%BD%E8%B1%A1%E4%BA%8B%E7%89%A9` – Zoltán Jan 27 '15 at 10:08
  • Okay, popping in that URL gives me java.io.IOException: Invalid % sequence: %E‌ in query at index 77: http://translate.google.com/translate_tts?tl=zh&q=%E7%94%B1%E4%BB%A3%E8%A1%A8%E‌​7%89%A9%E9%AB%94%E3%80%81%E6%8A%BD%E8%B1%A1%E4%BA%8B%E7%89%A9 – Matter Cat Jan 27 '15 at 10:17
  • Why are you adding the line with `urlConnection.setRequestProperty`? IIRC `application/x-www-form-urlencoded` is for POST request only and you're just using GET request, right? – TactMayers Jan 28 '15 at 07:23
  • Yeah, I'm just using GET. Leaving it in or taking it out makes no difference, though; either way it still doesn't work. – Matter Cat Jan 28 '15 at 07:57
  • Try diff "encode" with " utf-8;$charSet" that tells exact what to use for chinese . – Robert Rowntree Jan 28 '15 at 08:37
  • I'm afraid I'm not sure what you mean. Do you mean URL url = new URL(AUDIO_CHINESE + URLEncoder.encode(chText, "UTF-8$charset")); or something else? Or set something in the urlConnection? – Matter Cat Jan 28 '15 at 08:42
  • I don't think the problem is in the encoding of the URL itself: when I replace the text to be sent with its unicode version, I get the same result either way. I think it's how the TTS engine/Java is processing the input it receives. Either that or something's getting garbled along the way. – Matter Cat Jan 28 '15 at 08:43

1 Answers1

6

So, as it turns out, the problem at the end wasn't the encoding at all; it was the processing at Google's end. To get the service to correctly recognize UTF-8, you need to use this link http://www.translate.google.com/translate_tts?ie=utf-8&tl=zh-cn&q= instead of the one above. Note the ie=utf-8 added to the parameter. So you can just URLEncoder.encode("你好嗎", "UTF-8"), append it to the link, and send it up as per usual. Whew!

Matter Cat
  • 1,538
  • 1
  • 14
  • 23
  • Great answer thanks! I didn't think there'd be even one other person in the world with such a specific problem :) Out of interest, how did you arrive at your solution? – Alveoli Apr 13 '15 at 12:35
  • 2
    A lot of pain, tears, googling, trial and error, ritual sacrifices, etc. :P Such is life of an undocumented API. – Matter Cat Apr 14 '15 at 00:58