14

I'm currently working on a tool allowing me to read all my notifications thanks to the connection to different APIs.

It's working great, but now I would like to put some vocal commands to do some actions.

Like when the software is saying "One mail from Bob", I would like to say "Read it", or "Archive it".

My software is running through a node server, currently I don't have any browser implementation, but it can be a plan.

What is the best way in node JS to enable speech to text?

I've seen a lot of threads on it, but mainly it's using the browser and if possible, I would like to avoid that at the beginning. Is it possible?

Another issue is some software requires the input of a wav file. I don't have any file, I just want my software to be always listening to what I say to react when I say a command.

Do you have any information on how I could do that?

Cheers

Vico
  • 1,696
  • 1
  • 24
  • 57
  • I've seen some implementations that connect to google services for this, I assume that's what you're talking about. I doubt there will be a native speech parser without that much power for a while. – Phix Feb 26 '16 at 05:51

3 Answers3

6

Both of the answers here already are good, but what I think you're looking for is Sonus. It takes care of audio encoding and streaming for you. It's always listening offline for a customizable hotword (like Siri or Alexa). You can also trigger listening programmatically. In combination with a module like say, you could enable your example by doing something like:

say.speak('One mail from Bob', function(err) {
  Sonus.trigger(sonus, 1) //start listening
});

You can also use different hotwords to handle the subsequent recognized speech in a different way. For instance:
"Notifications. Most recent." and "Send message. How are you today"

Throw that onto a Pi or a CHIP with a microphone on your desk and you have a personal assistant that reads your notifications and reacts to commands.

Simple Example:
https://twitter.com/_evnc/status/811290460174041090

Something a bit more complex:
https://youtu.be/pm0F_WNoe9k?t=20s

Full documentation:
https://github.com/evancohen/sonus/blob/master/docs/API.md

Disclaimer: This is my project :)

evancohen
  • 677
  • 1
  • 5
  • 9
  • 1
    No Windows support. – sean Aug 03 '18 at 03:02
  • 1
    This package does little more than a bit of streaming. It offloads everything else to other libraries and only supports Google recognition anyway. Google's Speech to Text API only works through the cloud (so not offline) and is not free. The `say` package is for speech synthesis. – Phil Dec 30 '21 at 17:12
5

To recognize few commands without streaming them to the server you can use node-pocketsphinx module. Available in NPM.

The code to recognize few commands in continuos stream should look like this:

var fs = require('fs');

var ps = require('pocketsphinx').ps;

modeldir = "../../pocketsphinx/model/en-us/"

var config = new ps.Decoder.defaultConfig();
config.setString("-hmm", modeldir + "en-us");
config.setString("-dict", modeldir + "cmudict-en-us.dict");
config.setString("-kws", "keyword list");
var decoder = new ps.Decoder(config);

fs.readFile("../../pocketsphinx/test/data/goforward.raw", function(err, data) {
    if (err) throw err;
    decoder.startUtt();
    decoder.processRaw(data, false, false);
    decoder.endUtt();
    console.log(decoder.hyp())
});

Instead of readFile you just read the data from microphone and pass it to recognizer. The list of keywords to detect should look like this:

read it /1e-20/
archive it /1e-20/

For more details on spotting with pocketsphinx see Keyword Spotting in Speech and Recognizing multiple keywords using PocketSphinx

Community
  • 1
  • 1
Nikolay Shmyrev
  • 24,897
  • 5
  • 43
  • 87
4

To get audio data into your application, you could try a module like microphone, which I haven't used by it looks promising. This could be a way to avoid having to use the browser for audio input.

To do actual speech recognition, you could use the Speech to Text service of IBM Watson Developer Cloud. This service supports a websocket interface, so that you can have a full duplex service, piping audio data to the cloud and getting back the resulting transcription. You may want to consider implementing a form of onset detection in order to avoid transmitting a lot of (relative) silence to the service - that way, you can stay within the free tier.

There is also a text-to-speech service, but it sounds like you have a solution already for that part of your tool.

Disclosure: I am an evangelist for IBM Watson.

Abtin Forouzandeh
  • 5,635
  • 4
  • 25
  • 28