2

I am trying to add real-time speech recognition to my java project (preferably offline). Through some googling and trying other solutions, I settled on using VOSK for my speech recognition. The primary problem I am encountering, however, is that VOSK has very little documentation and comes with only one example file for java which is used to extract text from a prerecorded wav file, shown below.

public static void main(String[] argv) throws IOException, UnsupportedAudioFileException {
        LibVosk.setLogLevel(LogLevel.DEBUG);

        try (Model model = new Model("src\\main\\resources\\model");
                    InputStream ais = AudioSystem.getAudioInputStream(new BufferedInputStream(new FileInputStream("src\\main\\resources\\python_example_test.wav")));
                    Recognizer recognizer = new Recognizer(model, 16000)) {

            int nbytes;
            byte[] b = new byte[4096];
            while ((nbytes = ais.read(b)) >= 0) {
                System.out.println(nbytes);
                if (recognizer.acceptWaveForm(b, nbytes)) {
                    System.out.println(recognizer.getResult());
                } else {
                    System.out.println(recognizer.getPartialResult());
                }
            }

            System.out.println(recognizer.getFinalResult());
        }
    }

I attempted to convert this into something that would accept microphone audio, shown below:

public static void main(String[] args) {
        LibVosk.setLogLevel(LogLevel.DEBUG);
        AudioFormat format = new AudioFormat(8000.0f, 16, 1, true, true);
        TargetDataLine microphone;
        SourceDataLine speakers;

        try (Model model = new Model("src\\main\\resources\\model");
                Recognizer recognizer = new Recognizer(model, 16000)) {
            try {
                microphone = AudioSystem.getTargetDataLine(format);

                DataLine.Info info = new DataLine.Info(TargetDataLine.class, format);
                microphone = (TargetDataLine) AudioSystem.getLine(info);
                microphone.open(format);
                microphone.start();
                
                ByteArrayOutputStream out = new ByteArrayOutputStream();
                int numBytesRead;
                int CHUNK_SIZE = 1024;
                int bytesRead = 0;
                
                DataLine.Info dataLineInfo = new DataLine.Info(SourceDataLine.class, format);
                speakers = (SourceDataLine) AudioSystem.getLine(dataLineInfo);
                speakers.open(format);
                speakers.start();
                byte[] b = new byte[4096];

                while (bytesRead <= 100000) {
                    numBytesRead = microphone.read(b, 0, CHUNK_SIZE);
                    bytesRead += numBytesRead;
                    
                    out.write(b, 0, numBytesRead); 

                    speakers.write(b, 0, numBytesRead);

                    if (recognizer.acceptWaveForm(b, numBytesRead)) {
                        System.out.println(recognizer.getResult());
                    } else {
                        System.out.println(recognizer.getPartialResult());
                    }
                }
                System.out.println(recognizer.getFinalResult());
                speakers.drain();
                speakers.close();
                microphone.close();
            } catch (Exception e) {
                e.printStackTrace();
            }

        }

    }

This appears to be correctly capturing microphone data correctly (as it also outputs to the speakers) but VOSK shows no input, constantly printing results as empty strings. What am I doing wrong? Is what I am attempting even possible? Should I try to find a different library for speech recognition?

Squalmals
  • 23
  • 1
  • 3
  • The sample rate of recording (8KHz) does not match the recognizer's one (16KHz). Maybe try to make them equal. For example change the `AudioFormat` to 16KHz. – gthanop Jul 15 '21 at 22:18
  • Making them equal doesn't seem to fix it, unfortunately. – Squalmals Jul 16 '21 at 06:25
  • It sure looks like what you have should be good. From what you have described, the speakers are playing back what the mic is picking up? (You have confirmed data is being sent to VOSK). Does VOSK work if you have a .wav file source instead (have you confirmed their basic example works)? It would be really poor if there was some sort of "on" switch for the VOSK given this isn't shown in their example. Did you see their "test" code? https://github.com/alphacep/vosk-api/blob/master/java/lib/src/test/java/org/vosk/test/DecoderTest.java – Phil Freihofner Jul 16 '21 at 19:16
  • @PhilFreihofner, yes, the speakers do play back the microphone audio correctly, and I have also confirmed that the basic example works utilizing a .wav file. VOSK has very little documentation, unfortunately, so I don't know if there is an "on" switch. I am exploring other possibilities to java VOSK, but it would be really ideal if I was able to get this working. Do you have any ideas for alternatives to this solution or ways to make this work? – Squalmals Jul 18 '21 at 21:35
  • You settled on VOSK? Did you try Sphinx? I'm no expert but this is the more established solution. But if yu settled on something that has no Documentation maybe that fits You. – gpasch Jul 18 '21 at 23:30
  • Good to have those cases established as a baseline. The successful .wav playback suggests the idea of an "unflipped on switch" is not the issue. Sorry I don't have much in the way of concrete ideas. The main thing I'm thinking of is to try using the library in source form, which would allow you to enter some debug points into their code, and perhaps you could then verify the path of the data through their library and discover the point where it disappears. But that only works to the extent that operations are exposed and being handled in Java, and not sent to another system. – Phil Freihofner Jul 18 '21 at 23:53

2 Answers2

1

this code work correctly for me you can use this:

    public static void main(String[] args) {
    
    LibVosk.setLogLevel(LogLevel.DEBUG);
    
    AudioFormat format = new AudioFormat(AudioFormat.Encoding.PCM_SIGNED, 60000, 16, 2, 4, 44100, false);
    DataLine.Info info = new DataLine.Info(TargetDataLine.class, format);
    TargetDataLine microphone;
    SourceDataLine speakers;

    try (Model model = new Model("model");
         Recognizer recognizer = new Recognizer(model, 120000)) {
        try {

            microphone = (TargetDataLine) AudioSystem.getLine(info);
            microphone.open(format);
            microphone.start();

            ByteArrayOutputStream out = new ByteArrayOutputStream();
            int numBytesRead;
            int CHUNK_SIZE = 1024;
            int bytesRead = 0;

            DataLine.Info dataLineInfo = new DataLine.Info(SourceDataLine.class, format);
            speakers = (SourceDataLine) AudioSystem.getLine(dataLineInfo);
            speakers.open(format);
            speakers.start();
            byte[] b = new byte[4096];

            while (bytesRead <= 100000000) {
                numBytesRead = microphone.read(b, 0, CHUNK_SIZE);
                bytesRead += numBytesRead;

                out.write(b, 0, numBytesRead);

                speakers.write(b, 0, numBytesRead);

                if (recognizer.acceptWaveForm(b, numBytesRead)) {
                    System.out.println(recognizer.getResult());
                } else {
                    System.out.println(recognizer.getPartialResult());
                }
            }
            System.out.println(recognizer.getFinalResult());
            speakers.drain();
            speakers.close();
            microphone.close();
        } catch (Exception e) {
            e.printStackTrace();
        }
    }
}
user11370465
  • 106
  • 1
  • 2
1

I didn't get any of the code snippets from Squalmals or user11370465 working. In the Vosk installation documentation they say

When using your own audio file make sure it has the correct format - PCM 16khz 16bit mono.

The following code works on my system, Linux Mint 20, OpenJDK 11.

public static void main(String[] argv) throws Exception{
    LibVosk.setLogLevel(LogLevel.DEBUG);

    AudioFormat format = new AudioFormat(16000f, 16, 1, true, false);
    DataLine.Info info = new DataLine.Info(TargetDataLine.class, format);
    TargetDataLine microphone;

    Model model = new Model("my-path/vosk-model-small-en-us-0.15");
    Recognizer recognizer = new Recognizer(model, 16000);

    microphone = (TargetDataLine)AudioSystem.getLine(info);
    microphone.open(format);
    microphone.start();

    int numBytesRead;
    int CHUNK_SIZE = 4096;
    int bytesRead = 0;

    byte[] b = new byte[4096];

    while(bytesRead<=100000000){
        numBytesRead = microphone.read(b, 0, CHUNK_SIZE);

        bytesRead += numBytesRead;

        if(recognizer.acceptWaveForm(b, numBytesRead)){
            System.out.println(recognizer.getResult());
        }else{
            System.out.println(recognizer.getPartialResult());
        }
    }

    System.out.println(recognizer.getFinalResult());
    
    microphone.close();
}

Also, the JNA Vosk wrapper didn't work right out of the box for me. In LibVosk.java I had to change

Native.register(LibVosk.class, "vosk")

to

Native.register(LibVosk.class, "my-path/lib/python3.8/site-packages/vosk/libvosk.so");

In general, the Vosk speech recognition toolkit seems to work very well compared to other off-line speech recognition tools I have tried out. (Haven't tried CMUSphinx yet.) Vosk really needs better documentation and/or code comments though.

orjander
  • 21
  • 4