Text-To-Speech playback has very low volume after trying Speech-To-Text in Xamarin Forms App

Question

Disclaimer: I am a newbie to c# and Xamarin.Forms - sorry for missing anything obvious.

I am trying to create an app that takes user input in the form of a voice command (using Speech-To-Text) and outputs an audio announcement from the application (using Text-To-Speech).

The issue is that when you start recording audio for the Speech-To-Text service, the device's audio is set to recording mode (not sure what the technical term for this is called) and playback audio is set to a very low volume (as described in this SO question and here) and here.

I'm ideally looking for a way to revert this so that once the appropriate voice command is recognised (i.e. 'Secret command') via Speech-To-Text, the user can hear the secret phrase back at full/normal volume through Text-To-Speech in a Xamarin Forms application.

I tried to produce a working example by adapting the sample code for Azure Cognitive Speech Service. I cloned the code and adapted the Xaml and CS for the MainPage slightly, as shown below, to stop the speech recognition service once a certain voice command is triggered and then activate a phrase to be spoken via the Text-To-Speech service. My sample demonstrates the issue. If the user starts by selecting the Transcribe button and enters the appropriate voice command, they should hear back the secret phrase, but the playback volume is so low when testing on a physical IOS device I can barely hear it.

XAML

<ContentPage xmlns="http://xamarin.com/schemas/2014/forms"
             xmlns:x="http://schemas.microsoft.com/winfx/2009/xaml"
             xmlns:d="http://xamarin.com/schemas/2014/forms/design"
             xmlns:mc="http://schemas.openxmlformats.org/markup-compatibility/2006"
             mc:Ignorable="d"
             x:Class="CognitiveSpeechService.MyPage"
             Title="Speech Services Transcription"
             Padding="10,35,10,10">

    <StackLayout>
        <Frame BorderColor="DarkGray"
               CornerRadius="10"
               HeightRequest="300"
               WidthRequest="280"
               HorizontalOptions="Center"
               VerticalOptions="Start"
               BackgroundColor="LightGray">
            <ScrollView x:Name="scroll">
                <Label x:Name="transcribedText"
                       Margin="10,10,10,10" />
            </ScrollView>
        </Frame>

        <ActivityIndicator x:Name="transcribingIndicator"
                           HorizontalOptions="Center"
                           VerticalOptions="Start"
                           WidthRequest="300"
                           IsRunning="False" />
        <Button x:Name="transcribeButton"
                WidthRequest="300"
                HeightRequest="50"
                Text="Transcribe"
                TextColor="White"
                CornerRadius="10"
                BackgroundColor="Green"
                BorderColor="DarkGray"
                BorderWidth="1"
                FontAttributes="Bold"
                HorizontalOptions="Center"
                VerticalOptions="Start"
                Clicked="TranscribeClicked"/>

        <Button x:Name="SpeakBtn"
                WidthRequest="300"
                HeightRequest="50"
                Text="Speak"
                TextColor="White"
                CornerRadius="10"
                BackgroundColor="Red"
                BorderColor="DarkGray"
                BorderWidth="1"
                FontAttributes="Bold"
                HorizontalOptions="Center"
                VerticalOptions="Start"
                Clicked="SpeakBtn_Clicked"/>

    </StackLayout>

</ContentPage>

Code-behind

namespace CognitiveSpeechService
{
    public partial class MyPage : ContentPage
    {

        AudioRecorderService recorder = new AudioRecorderService();

        SpeechRecognizer recognizer;
        IMicrophoneService micService;
        bool isTranscribing = false;

        public MyPage()
        {
            InitializeComponent();

            micService = DependencyService.Resolve<IMicrophoneService>();
        }

        async void TranscribeClicked(object sender, EventArgs e)
        {
            bool isMicEnabled = await micService.GetPermissionAsync();

            // EARLY OUT: make sure mic is accessible
            if (!isMicEnabled)
            {
                UpdateTranscription("Please grant access to the microphone!");
                return;
            }

            // initialize speech recognizer 
            if (recognizer == null)
            {
                var config = SpeechConfig.FromSubscription(Constants.CognitiveServicesApiKey, Constants.CognitiveServicesRegion);
                recognizer = new SpeechRecognizer(config);
                recognizer.Recognized += (obj, args) =>
                {
                    UpdateTranscription(args.Result.Text);
                };
            }

            // if already transcribing, stop speech recognizer
            if (isTranscribing)
            {
                StopSpeechRecognition();
            }

            // if not transcribing, start speech recognizer
            else
            {
                Device.BeginInvokeOnMainThread(() =>
                {
                    InsertDateTimeRecord();
                });
                try
                {
                    await recognizer.StartContinuousRecognitionAsync();
                }
                catch (Exception ex)
                {
                    UpdateTranscription(ex.Message);
                }
                isTranscribing = true;
            }
            UpdateDisplayState();
        }

        // https://stackoverflow.com/questions/56514413/volume-has-dropped-significantly-in-text-to-speech-since-adding-speech-to-text
        private async void StopSpeechRecognition()
        {
            if (recognizer != null)
            {
                try
                {
                    await recognizer.StopContinuousRecognitionAsync();
                    Console.WriteLine($"IsRecording: {recorder.IsRecording}");
                }
                catch (Exception ex)
                {
                    UpdateTranscription(ex.Message);
                }
                isTranscribing = false;
                UpdateDisplayState();
            }
        }

        void UpdateTranscription(string newText)
        {
            Device.BeginInvokeOnMainThread(() =>
            {
                if (!string.IsNullOrWhiteSpace(newText))
                {

                    if (newText.ToLower().Contains("Secret command"))
                    {
                        Console.WriteLine("heart rate voice command detected");

                        // stop speech recognition
                        StopSpeechRecognition();

                        // do callout
                        string success = "this works!";

                        var settings = new SpeechOptions()
                        {
                            Volume = 1.0f,
                        };

                        TextToSpeech.SpeakAsync(success, settings);

                        // start speech recongition 


                    } else
                    {
                        transcribedText.Text += $"{newText}\n";
                    }
                }
            });
        }

        void InsertDateTimeRecord()
        {
            var msg = $"=================\n{DateTime.Now.ToString()}\n=================";
            UpdateTranscription(msg);
        }

        void UpdateDisplayState()
        {
            Device.BeginInvokeOnMainThread(() =>
            {
                if (isTranscribing)
                {
                    transcribeButton.Text = "Stop";
                    transcribeButton.BackgroundColor = Color.Red;
                    transcribingIndicator.IsRunning = true;
                }
                else
                {
                    transcribeButton.Text = "Transcribe";
                    transcribeButton.BackgroundColor = Color.Green;
                    transcribingIndicator.IsRunning = false;
                }
            });
        }

        async void SpeakBtn_Clicked(object sender, EventArgs e)
        {
            await TextToSpeech.SpeakAsync("Sample audio line. Blah blah blah. ");
        }
    }
}

Thanks for your help!

What's the code of `AudioRecorderService`? If it is convinient for you, could you please post a basic demo to github or onedriver so that we can test on our side? — Jessie Zhang -MSFT, Aug 26 '21 at 08:30
@JessieZhang-MSFT `AudioRecorderService` is from a plugin I was using to diagnose the problem. Please ignore. I created a test repo to demonstrate the issue better [here] (https://github.com/TketEZ/xamarin-forms-samples). Hopefully, this is much clearer. — ProgrammerInPractice, Aug 26 '21 at 09:58

score 1 · Answer 1 · answered Aug 29 '21 at 19:38

Found a working solution. Posting it below for whoever else it can help and future me.

I noticed this issue was only happening on IOS and not Android, it has to do with the category that AVAudioSession is set to when STT is enabled. As I best understand it, once STT is enabled, audio-ducking turns on for any non-STT-related audio.

You can resolve this issue by programmatically setting the right category using the AVAudioSession Xamarin.IOS API.

To get this working properly in a Xamarin.Forms project, you will need to use the Dependency Service to execute the Xamarin.IOS code in your shared project code.

I have set out the relevant bits of the code that worked for me below.

A full working example can be found in the solution branch of the Github repo mentioned in the comments above.

Mainpage (where STT and TTS services are happening)

    public partial class MainPage : ContentPage
    {
        IAudioSessionService audioService;

        public MainPage()
        {
            InitializeComponent();

            micService = DependencyService.Resolve<IMicrophoneService>();

            if (Device.RuntimePlatform == Device.iOS)
            {
                audioService = DependencyService.Resolve<IAudioSessionService>();
            }
        }

        public void SpeechToText()
        {
            // wherever STT is required, call this first to set the right audio category
            audioService?.ActivateAudioRecordingSession();
        }

        public void TextToSpeech()
        {
            // wherever TTS is required, let the OS know that you're playing audio so TTS interrupts instead of ducking. 
            audioService?.ActivateAudioPlaybackSession();

            await TextToSpeech.SpeakAsync(TextForTextToSpeechAfterSpeechToText, settings);

            // set audio session back to recording mode ready for STT
            audioService?.ActivateAudioRecordingSession();
        }

IAudioSessionService

// this interface should be in your shared project 
namespace CognitiveSpeechService.Services
{
    public interface IAudioSessionService
    {
        void ActivateAudioPlaybackSession();
        void ActivateAudioRecordingSession();
    }
}

project.Android/AndroidAudioSessionService

using System;
using CognitiveSpeechService.Services;
using Xamarin.Forms;

[assembly: Dependency(typeof(CognitiveSpeechService.Droid.Services.AndroidAudioSessionService))]
namespace CognitiveSpeechService.Droid.Services
{
    public class AndroidAudioSessionService : IAudioSessionService
    {
        public void ActivateAudioPlaybackSession()
        {
            // do nothing as not required on Android
        }

        public void ActivateAudioRecordingSession()
        {
            // do nothing as not required on Android
        }
    }
}

Project.iOS/IOSAudioSessionService

using System;
using AVFoundation;
using CognitiveSpeechService.Services;
using Foundation;
using Xamarin.Forms;

[assembly: Dependency(typeof(CognitiveSpeechService.iOS.Services.IOSAudioSessionService))]
namespace CognitiveSpeechService.iOS.Services
{
    public class IOSAudioSessionService : IAudioSessionService
    {
        public void ActivateAudioPlaybackSession()
        {
            var session = AVAudioSession.SharedInstance();
            session.SetCategory(AVAudioSessionCategory.Playback, AVAudioSessionCategoryOptions.DuckOthers);
            session.SetMode(AVAudioSession.ModeSpokenAudio, out NSError error);
            session.SetActive(true);
        }

        public void ActivateAudioRecordingSession()
        {
            try
            {
                new System.Threading.Thread(new System.Threading.ThreadStart(() =>
                {
                    var session = AVAudioSession.SharedInstance();
                    session.SetCategory(AVAudioSessionCategory.Record);
                    session.SetActive(true);
                })).Start();
            }
            catch (Exception ex)
            {
                Console.WriteLine(ex.Message);
            }
        }
    }
}

This solution fixes the low volume problem in iOS with SimpleAudioPlayer. However, it breaks Microsoft.CognitiveServices.Speech ListenOnceAsync(), in that only the first couple of words are picked up. The recording session is cut short and the remaining spoken words are lost. — Bruce Haley, Mar 19 '23 at 18:55

score 0 · Answer 2 · answered Jun 02 '23 at 18:42

ProgrammerInPractice's solution did not work for me. (See my comment on it.) I found a solution that works here: Toggle audio in speaker to ear-speaker and vice versa in iphone but microphone is muted

It allowed both microphone and speakerphone to work simultaneously on the iPhone.

Text-To-Speech playback has very low volume after trying Speech-To-Text in Xamarin Forms App

2 Answers2