Google Speech-to-Text for streaming audio with Java

Question

I am trying to use the Google Speech-to-Text API to do some voice-to-voice translation (also using Translation and Text-to-Speech). I would like for a person to speak into the microphone and for that text to be transcribed to text. I used the streaming audio tutorial found in the google documentation as a base for this method. I would also like the audio stream to stop when the person has stopped speaking.

Here is the modified method:

public static String streamingMicRecognize(String language) throws Exception {

        ResponseObserver<StreamingRecognizeResponse> responseObserver = null;
        try (SpeechClient client = SpeechClient.create()) {

            responseObserver =
                    new ResponseObserver<StreamingRecognizeResponse>() {
                ArrayList<StreamingRecognizeResponse> responses = new ArrayList<>();

                public void onStart(StreamController controller) {}

                public void onResponse(StreamingRecognizeResponse response) {
                    responses.add(response);
                }

                public void onComplete() {
                    SPEECH_TO_TEXT_ANSWER = "";
                    for (StreamingRecognizeResponse response : responses) {
                        StreamingRecognitionResult result = response.getResultsList().get(0);
                        SpeechRecognitionAlternative alternative = result.getAlternativesList().get(0);
                        System.out.printf("Transcript : %s\n", alternative.getTranscript());
                        SPEECH_TO_TEXT_ANSWER = SPEECH_TO_TEXT_ANSWER + alternative.getTranscript();
                    }
                }

                public void onError(Throwable t) {
                    System.out.println(t);
                }
            };

            ClientStream<StreamingRecognizeRequest> clientStream =
                    client.streamingRecognizeCallable().splitCall(responseObserver);

            RecognitionConfig recognitionConfig =
                    RecognitionConfig.newBuilder()
                    .setEncoding(RecognitionConfig.AudioEncoding.LINEAR16)
                    .setLanguageCode(language)
                    .setSampleRateHertz(16000)
                    .build();
            StreamingRecognitionConfig streamingRecognitionConfig =
                    StreamingRecognitionConfig.newBuilder().setConfig(recognitionConfig).build();

            StreamingRecognizeRequest request =
                    StreamingRecognizeRequest.newBuilder()
                    .setStreamingConfig(streamingRecognitionConfig)
                    .build(); // The first request in a streaming call has to be a config

            clientStream.send(request);
            // SampleRate:16000Hz, SampleSizeInBits: 16, Number of channels: 1, Signed: true,
            // bigEndian: false
            AudioFormat audioFormat = new AudioFormat(16000, 16, 1, true, false);
            DataLine.Info targetInfo =
                    new Info(
                            TargetDataLine.class,
                            audioFormat); // Set the system information to read from the microphone audio stream

            if (!AudioSystem.isLineSupported(targetInfo)) {
                System.out.println("Microphone not supported");
                System.exit(0);
            }
            // Target data line captures the audio stream the microphone produces.
            TargetDataLine targetDataLine = (TargetDataLine) AudioSystem.getLine(targetInfo);
            targetDataLine.open(audioFormat);
            targetDataLine.start();
            System.out.println("Start speaking");
            playMP3("beep-07.mp3");
            long startTime = System.currentTimeMillis();
            // Audio Input Stream
            AudioInputStream audio = new AudioInputStream(targetDataLine);
            long estimatedTime = 0, estimatedTimeStoppedSpeaking = 0, startStopSpeaking = 0;
            int currentSoundLevel = 0;
            Boolean hasSpoken = false;
            while (true) {
                estimatedTime = System.currentTimeMillis() - startTime;
                byte[] data = new byte[6400];
                audio.read(data);

                currentSoundLevel = calculateRMSLevel(data);
                System.out.println(currentSoundLevel);

                if (currentSoundLevel > 20) {
                    estimatedTimeStoppedSpeaking = 0;
                    startStopSpeaking = 0;
                    hasSpoken = true;
                }
                else {
                    if (startStopSpeaking == 0) {
                        startStopSpeaking = System.currentTimeMillis();
                    }
                    estimatedTimeStoppedSpeaking = System.currentTimeMillis() - startStopSpeaking;
                }

                if ((estimatedTime > 15000) || (estimatedTimeStoppedSpeaking > 1000 && hasSpoken)) { // 15 seconds or stopped speaking for 1 second
                    playMP3("beep-07.mp3");
                    System.out.println("Stop speaking.");
                    targetDataLine.stop();
                    targetDataLine.drain();
                    targetDataLine.close();
                    break;
                }
                request =
                        StreamingRecognizeRequest.newBuilder()
                        .setAudioContent(ByteString.copyFrom(data))
                        .build();
                clientStream.send(request);
            }
        } catch (Exception e) {
            System.out.println(e);
        }
        responseObserver.onComplete();
        String ans = SPEECH_TO_TEXT_ANSWER;
        return ans;
    }

The output is supposed to be transcribed text in string form. However, it is very inconsistent. Most of the time, it returns an empty string. However, sometimes the program does work and does return the transcribed text.

I have also tried to record the audio seperately while the program is running. Although the method returned an empty string, when I saved the audio file recorded seperately and sent that directly through the api, it returned the correct transcribed text.

I do not understand why/how the program is only working some of the time.

`audio.read(data)` is not guaranteed to fill the `data` array, as [the documentation](https://docs.oracle.com/en/java/javase/12/docs/api/java.desktop/javax/sound/sampled/AudioInputStream.html#read%28byte[]%29) states. You need to make use of the byte count returned by that method. — VGR, Aug 19 '19 at 15:36
@VGR I am not quite sure I understand what you mean. What am I supposed to use the returned byte count for? To make sure it is 6400? And if it isn`t 6400 then how would I fix it? — g123, Aug 19 '19 at 16:34
Yes. The method may read 6400 bytes, or it may read only 6300 bytes, or may read eight bytes. You need to write only as many bytes as the `read` method reported it read. You will need to pass the number of bytes read to your `calculateRMSLevel` method and to the ByteString.copyFrom method. — VGR, Aug 19 '19 at 16:39
@VGR the function is always returning 6400... whether the transcription is working or not — g123, Aug 19 '19 at 18:49

Google Speech-to-Text for streaming audio with Java

0 Answers0