Audio written to Twilio websocket in x-audio/mulaw 8kHz is garbage

Question

I'm currently using a text-to-speech system to write desired text to a websocket using the Twilio Connect functionality. As they mention in their documentation here, they want the audio to be Base64 encoded $\mu$-law audio at 8 kHz. The TTS system I use generates a float array at 24kHz and I have written the downsampled to 8kHz version of it to a .wav file and verified that it sounds good.

# Generate audio from text
audio_array = generate_audio(text_prompt)

# Resample from 24000 to 8000
from scipy.signal import resample
audio_resampled = resample(audio_array, len(audio_array) * PHONE_SAMPLE_RATE // SAMPLE_RATE)

with open('out.wav', 'wb') as f:
    write_wav(f, PHONE_SAMPLE_RATE, audio_resampled)
# verified that 'out.wav' plays on my computer and sounds good

Next, I do mu-law encoding using the mu_compress function available in librosa:

from librosa import mu_compress
mulaw_audio = mu_compress(audio_resampled).astype(np.uint8)

Finally, I convert to a base64 encoded string:

wav_base64 = base64.b64encode(mulaw_audio).decode()

Then, I put this in the JSON format as described in the Twilio docs and write it to the websocket. However, the audio that plays over the phone is angry loud noise with no content. Clearly, there must be something wrong with the data that is being passed into the socket.

I've verified that the base64 string doesn't change whether I do unsigned or signed ints and that the audio generated sounds good on disk. So, there must be something wrong with either my mu-law or base64 encodings.

I've also verified that there are no headers present in audio_array.

I'd really appreciate some help debugging this, as I haven't found a Python example online of someone writing audio to a Twilio websocket but I'm sure plenty of people have done it.

EDIT: SOLVED. The solution was to use the (to-be-deprecated) audioop library from the standard libs. Solution below, which generates audio that Twilio understands.

 # Generate audio from text
    audio_array = generate_audio(text_prompt, history_prompt="v2/en_speaker_6")

    # Resample from 24000 to 8000
    audio_resampled = resample(audio_array, len(audio_array) * PHONE_SAMPLE_RATE // SAMPLE_RATE)

    # quantize to PCM
    pcm_array = np.int16(audio_resampled * 32768).tobytes()

    # take the 16-bit PCM linear down to 8-bit mu-law
    mulaw_array = audioop.lin2ulaw(pcm_array, 2)

    # Convert to base64
    wav_base64 = base64.b64encode(mulaw_array).decode()

can't directly write it to disk as mulaw audio I think but if I convert to mulaw and then convert back to linear and play it it is fine. Is there a straighforward way to write the mulaw audio and play it? — Viraj Mehta, Jul 12 '23 at 00:33
Why it's not possible? I'd think `wave` module with `writeframes` should work just fine. Remember to `setparams`. — Lukasz Tracewski, Jul 12 '23 at 07:01
you're right Lukasz, that ended up being the prod I needed to get this working. Always need a tiny and easy-to-run reproducible example. Thank you! — Viraj Mehta, Jul 12 '23 at 16:50
Great to hear that, and nice that you posted update on this. One tip: instead of editing the question, post the answer. — Lukasz Tracewski, Jul 12 '23 at 18:32

Audio written to Twilio websocket in x-audio/mulaw 8kHz is garbage

0 Answers0