1

I am trying to use .NET's System.Speech SpeechRecognitionEngine object to recognize words spoken by a discord user in a voice channel. The raw pcm audio received by the bot is written to a MemoryStream, and I am trying to get the SpeechRecognitionEngine to use this Stream for recognition. Getting this data and writing it works fine, however, using it with the SpeechRecognitionEngine seemingly doesn't work for multiple reasons. For one, the stream is not infinite, and the recognizer reaches the end of the stream and stops before words can even be spoken. Even when data is constantly added to the stream (i.e. the user is continually talking) the recognizer still reaches the end of the stream and refuses to continue. Another issue is that the method to run recognition seemingly cannot be run more than once. I've tried feeding the stream to the recognizer in chunks, however it didn't seem to work. There is an option to set the input to your default audio device, and that works exactly how I want mine to, always running and not stopping even when the user doesn't provide any input. Any help?

private SpeechRecognitionEngine recognizer = new SpeechRecognitionEngine();
public MemoryStream stream = new MemoryStream();

//called before any other method when the bot joins the voice channel
public void StartRun(){
    Choices commands = new Choices();
    commands.Add(new string[] { "hello", "hey bot"});
    GrammarBuilder gBuilder = new GrammarBuilder();
    gBuilder.Append(commands);
    Grammar grammar = new Grammar(gBuilder);
    
    recognizer.LoadGrammar(grammar);
    recognizer.SetInputToAudioStream(holdStream, new SpeechAudioFormatInfo(48000, AudioBitsPerSample.Sixteen, AuidoChannel.Mono));

    recognizer.SpeechRecognized += async (s, e) => {} //handles 
    
    //the eventHandler i have for this event prints something whenever it reaches the end of the stream
    recognizer.RecognizeCompleted += RecognizeCompleted;
    recognizer.RecognizeAsync(RecognizeMode.Multiple);
}

In another program I write to the pcm data to 'stream', if there are any syntax errors it is because of copying the code by hand instead of copying and pasting in order to simplify my code. Thank you!

Cojack
  • 37
  • 1
  • 9
  • is stream and holdStream the same thing? – Ralf Jun 23 '21 at 20:29
  • If you're trying to start the speech engine with a MemoryStream, and THEN populate the stream, I don't think that's ever going to work. If you want to stream voice data to the engine after it is started, you'll need a blocking stream. MemoryStream signals End Of Stream after you read the last byte. You need a pipe. – glenebob Jun 23 '21 at 20:47
  • @Ralf yes it is, sorry. some parts were copy pasted and some werent – Cojack Jun 23 '21 at 20:49
  • @glenebob That sounds right, I kinda just chose a memoryStream at random because it seemed to fit and worked with my beta tests. In that case, should I use .NET's PipeStream class? Or is there a better way of going about this? Thank you! – Cojack Jun 23 '21 at 20:52
  • @glenebob Looking into it, a bufferedStream seems to fit more with what you were saying – Cojack Jun 23 '21 at 21:02
  • @Cojack BufferedStream is definitely not what you want. It just adds buffering on top of some other stream. The fundamental problems you're facing with MemoryStreams is that 1) they don't block waiting for more data, and 2) they only have one position property. Adding a BufferedStream won't help with either of those problems. – glenebob Jun 24 '21 at 14:02
  • @Cojack You could use AnonymousPipe*Streams, but I would not recommend it, as they involve an underlying OS pipe, which allows IPC. You don't appear to need IPC; everything occurs within one process. OS pipes represent much more overhead than you need. – glenebob Jun 24 '21 at 14:05
  • @Cojack I believe the answer here is to implement a pipe based on an array of buffers, and a stream which can either write or read to an instance of that pipe. You would create the pipe and a pair of streams, hand the read stream to the speech engine, and then write the data to the write pipe. I'm commenting (not answering) because such a class library is not exactly trivial to implement - it involves fairly intricate thread synchronization, and I'm not aware of such a library available for download. Knowing what you need may help you find it, though. – glenebob Jun 24 '21 at 14:14
  • @glenebob Aren't pipes exclusively for IPC? Why would I need a pipe when I wont be doing any inter process communication? – Cojack Jun 25 '21 at 01:49
  • @Cojack you need a pipe (just another term for FIFO) because the problem you're trying to solve appears to require the data handling behavior that a pipe exhibits. Pipes are often used for IPC, but that's obviously not their only purpose. You're confusing the generic term "pipe" with terms for specific implementations such as "anonymous pipe" and "named pipe". You need a pipe, but you don't need an OS supplied pipe. – glenebob Jun 25 '21 at 17:17
  • @glenebob Ahhh ok, that makes more sense. I’ll definitely look into this! Thank you – Cojack Jun 25 '21 at 18:54
  • @glenebob So just to clarify, Ill need to use a pipe (which is FIFO communication, like a queue), and two streams, one that can write to this pipe and one that can read the pipe and is connected to the recognitionEngine. Should I create my own Stream class by inheriting from .NET's abstract Stream class? I guess I'm just confused about how this allows for blocking. Thank you for all your help – Cojack Jun 28 '21 at 18:54
  • @Cojack that is how I would implement it if I was going for something generic. But you can actually accomplish what you need by implementing the pipe and both streams in one custom stream class. In fact, you only need a stream to satisfy the recognition engine's interface; you're free to push data into the pipe however you like. Accomplishing blocking on the read side is going to be your biggest challenge. – glenebob Jun 28 '21 at 20:52
  • @glenebob and what is the general idea of how to go about blocking? Do I have to block the thread the recognitionEngine is using? And how does using a queue differ from just writing it to a stream? Is it just there to solve the issue of only having 1 position property and blocking the recEngine from reading more data is entirely different? Sorry for all the questions I’m just still confused – Cojack Jun 28 '21 at 23:27
  • @Cojack I threw together a set of classes you can try. Usage is demonstrated in a unit test. It is not well tested and needs parameter validations added, which I'll do later. But I think it does what you need. I've implemented this type of thing before (I don't have, nor do I own that code) because I couldn't find anything online. Maybe this will prove helpful to others as well. https://github.com/glenebob/Pipe/tree/master/Pipe – glenebob Jun 30 '21 at 00:19
  • @glenebob Thank you so much. this is amazing! I'm going to try and read the code and figure it all out before I implement it, but I'll be sure to tell you if it does! Thank you again – Cojack Jun 30 '21 at 00:57
  • @Cojack you are most welcome. Feel free to comment or submit issues and pull requests over on GitHub. – glenebob Jul 01 '21 at 23:57
  • @Cojack I decided to actually try this out against the speech engine, and it doesn't work. The engine expects a seekable stream, which a pipe is not. You can modify PipeReadStream to make it work (don't throw exceptions), but you can also do it by loading your MemoryStream with all the data BEFORE handing it to the speech engine. Note that you'll also need to reset the stream (Seek(0, Origin.Begin) after loading it. The missing Seek() is also why you're having trouble using the same stream twice. – glenebob Jul 02 '21 at 22:53
  • @glenebob This is honestly good news, I've been trying for a couple hours and kept running into problems, knowing it's not me making a stupid mistake is relieving. The whole point of the project is to have live recognition, so populating the MemoryStream after (mostly, if I cant find a solution I'll resort to that) defeats the purpose. I found a REALLY old stackoverflow forum that seems to address the exact issue, but like I said its very old. Do you know if this would still work? Even if it didnt end up working out thank you so much for all the help – Cojack Jul 02 '21 at 22:59
  • @glenebob https://stackoverflow.com/questions/1682902/streaming-input-to-system-speech-recognition-speechrecognitionengine – Cojack Jul 02 '21 at 23:00
  • @Cojack Look at the Stream modifications posted in that article... You should be able to modify my stream implementation (and maybe the pipe itself) in a similar fashion. What I wrote is a generic pipe; what you end up with will be a specialized pipe useful only for this project, so keep that in mind when coming up with a name, but I see no reason it can't work. I'm not sure I'll have time to be of much help this weekend, but if you have a snippet of voice data that the engine recognizes, maybe you could send it to me and I'll see what I can come up with - glenebob@gmail.com – glenebob Jul 03 '21 at 18:04

0 Answers0