I'm currently using a text-to-speech system to write desired text to a websocket using the Twilio Connect functionality. As they mention in their documentation here, they want the audio to be Base64 encoded $\mu$-law audio at 8 kHz. The TTS system I use generates a float array at 24kHz and I have written the downsampled to 8kHz version of it to a .wav file and verified that it sounds good.
# Generate audio from text
audio_array = generate_audio(text_prompt)
# Resample from 24000 to 8000
from scipy.signal import resample
audio_resampled = resample(audio_array, len(audio_array) * PHONE_SAMPLE_RATE // SAMPLE_RATE)
with open('out.wav', 'wb') as f:
write_wav(f, PHONE_SAMPLE_RATE, audio_resampled)
# verified that 'out.wav' plays on my computer and sounds good
Next, I do mu-law encoding using the mu_compress
function available in librosa:
from librosa import mu_compress
mulaw_audio = mu_compress(audio_resampled).astype(np.uint8)
Finally, I convert to a base64 encoded string:
wav_base64 = base64.b64encode(mulaw_audio).decode()
Then, I put this in the JSON format as described in the Twilio docs and write it to the websocket. However, the audio that plays over the phone is angry loud noise with no content. Clearly, there must be something wrong with the data that is being passed into the socket.
I've verified that the base64 string doesn't change whether I do unsigned or signed ints and that the audio generated sounds good on disk. So, there must be something wrong with either my mu-law or base64 encodings.
I've also verified that there are no headers present in audio_array.
I'd really appreciate some help debugging this, as I haven't found a Python example online of someone writing audio to a Twilio websocket but I'm sure plenty of people have done it.
EDIT: SOLVED. The solution was to use the (to-be-deprecated) audioop library from the standard libs. Solution below, which generates audio that Twilio understands.
# Generate audio from text
audio_array = generate_audio(text_prompt, history_prompt="v2/en_speaker_6")
# Resample from 24000 to 8000
audio_resampled = resample(audio_array, len(audio_array) * PHONE_SAMPLE_RATE // SAMPLE_RATE)
# quantize to PCM
pcm_array = np.int16(audio_resampled * 32768).tobytes()
# take the 16-bit PCM linear down to 8-bit mu-law
mulaw_array = audioop.lin2ulaw(pcm_array, 2)
# Convert to base64
wav_base64 = base64.b64encode(mulaw_array).decode()