0

I've been chasing my tail for two days figuring out how to best approach sending the <Buffer ... > object generated by Google's Text-To-Speech service, from my express-api to my React app. I've come across tons of different opinionated resources that point me in different directions and only potentially "solve" isolated parts of the bigger process. At the end of all of this, while I've learned a lot more about ArrayBuffer, Buffer, binary arrays, etc. yet I still feel just as lost as before in regards to implementation.

At its simplest, all I aim to do is provide one or more strings of text to tts, generate the audio files, send the audio files from my express-api to my react client, and then automatically play the audio in the background on the browser when appropriate.

I am successfully sending and triggering google's tts to generate the audio files. It responds with a <Buffer ...> representing the binary data of the file. It arrives in my express-api endpoint, from there I'm not sure if I should ...

then once it's on the browser,

  • do I use an <audio /> tag?
  • should I convert it to something else?

I suppose the problem I'm having is trying to find answers for this results in an information overload consisting of various different answers that have been written over the past 10 years using different approaches and technologies. I really don't know where one starts and the next ends, what's a bad practice, what's a best practice, and moreover what is actually suitable for my case. I could really use some guidance here.

Synthesise function from Google

// returns: <Buffer ff f3 44 c4 ... />
  const synthesizeSentence = async (sentence) => {
    const request = {
      input: { text: sentence },

      voice: { languageCode: "en-US", ssmlGender: "NEUTRAL" },
      audioConfig: { audioEncoding: "MP3" },
    };

    const response = await client.synthesizeSpeech(request);
    return response[0].audioContent;
  };

(current shape) of express-api POST endpoint

app.post("/generate-story-support", async (req, res) => {
  try {
    // ? generating the post here for simplicity, eventually the client
    // ? would dictate the sentences to send ...
    const ttsResponse: any = await axios.post("http://localhost:8060/", {
      sentences: SAMPLE_SENTENCES,
    });

    // a resource said to send the response as a string and then convert
    // it on the client to an Array buffer? -- no idea if this is a good practice
    return res.status(201).send(ttsResponse.data[0].data.toString());
  } catch (error) {
    console.log("error", error);
    return res.status(400).send(`Error: ${error}`);
  }
});

react client

so post

  useEffect(() => {
    const fetchData = async () => {
      const data = await axios.post(
        "http://localhost:8000/generate-story-support"
      );
      // converting it to an ArrayBuffer per another so post
      const encoder = new TextEncoder();
      const encodedData = encoder.encode(data.data);
      setAudio(encodedData);
      return data.data;
    };

    fetchData();
  }, []);

  // no idea what to do from here, if this is even the right path :/ 
kevin
  • 2,707
  • 4
  • 26
  • 58

2 Answers2

1

Next client component:

"use client";

import React, { useState } from "react";

const Text2Speech = () => {
  const [isFetching, setIsFetching] = useState(false);
  const audioRef = React.useRef(new Audio());

  const fetchAudio = () => {
    setIsFetching(true);
    fetch("http://localhost:3000/api/interview/response")
      .then((response) => {
        if (!response.ok) {
          throw new Error("Network response was not ok");
        }
        return response.json(); 
      })
      .then((json) => {
        const bufferData = new Uint8Array(json.data.data);
        const blob = new Blob([bufferData], { type: "audio/wav" });
        const objectURL = URL.createObjectURL(blob);
        audioRef.current.src = objectURL;
        audioRef.current.play();
        setIsFetching(false);
      })
      .catch((error) => {
        console.error("Error fetching audio:", error);
        setIsFetching(false);
      });
  };

  return (
    <div>
      <button onClick={fetchAudio} disabled={isFetching}>
        {isFetching ? "Loading..." : "Play"}
      </button>
    </div>
  );
};

export default Text2Speech;

Server function getting track

import textToSpeech from "@google-cloud/text-to-speech";

const client = new textToSpeech.TextToSpeechClient();

export async function convertTextToSpeech() {
  const text = "Привет. Как дела, чем занимаешься? ";
  const request = {
    input: {
      text,
    },
    voice: { languageCode: "ru", ssmlGender: undefined },
    audioConfig: { audioEncoding: 2 },
  };
  const [response] = await client.synthesizeSpeech(request);

  return response.audioContent;
}

API route responding to the client with a buffer:

import { NextResponse } from "next/server";
import { convertTextToSpeech } from "./text-to-speech";

export async function GET(request: Request) {
  const audio = await convertTextToSpeech();
  return NextResponse.json({ data: audio });
}
Pavostik
  • 11
  • 1
0

For future eyes,

I ended up leveraging the browser Audio API and went through the process of passing the Buffer from Google's TTS, to my API, and then forwarding it to my client, where the Buffer was converted into an ArrayBuffer and then further decoded by the Audio API where it placed it on a node within the context of the new audio context. If none of what I just said made sense, visit the Audio API link and start there since it's a really great resource and shows how to handle a variety of cases.

kevin
  • 2,707
  • 4
  • 26
  • 58