3

I've revisited CMU Sphinx recently and attempted to set up a basic hot-word detector for Android, starting from the tutorial and adapting the sample application.

I'm having various issues, which I've been unable to resolve, despite delving deep into their documentation, until I can read no more...

In order to replicate them, I made a basic project that was designed to detect the keywords wakeup you and wakeup me.

My dictionary:

me M IY
wakeup W EY K AH P
you Y UW

My language model:

\data\
ngram 1=5
ngram 2=5
ngram 3=4

\1-grams:
-0.9031 </s> -0.3010
-0.9031 <s> -0.2430
-1.2041 me -0.2430
-0.9031 wakeup -0.2430
-1.2041 you -0.2430

\2-grams:
-0.3010 <s> wakeup 0.0000
-0.3010 me </s> -0.3010
-0.6021 wakeup me 0.0000
-0.6021 wakeup you 0.0000
-0.3010 you </s> -0.3010

\3-grams:
-0.6021 <s> wakeup me
-0.6021 <s> wakeup you
-0.3010 wakeup me </s>
-0.3010 wakeup you </s>

\end\

Both of the above were created using the suggested tool.

And my key-phrases file:

wakeup you /1e-20/
wakeup me /1e-20/

Adapting the example application linked above, here is my code:

public class PocketSphinxActivity extends Activity implements RecognitionListener {

    private static final String CLS_NAME = PocketSphinxActivity.class.getSimpleName();

    private static final String HOTWORD_SEARCH = "hot_words";

    private volatile SpeechRecognizer recognizer;

    @Override
    public void onCreate(Bundle state) {
        super.onCreate(state);
        setContentView(R.layout.main);

        new AsyncTask<Void, Void, Exception>() {
            @Override
            protected Exception doInBackground(Void... params) {
                Log.i(CLS_NAME, "doInBackground");

                try {

                    final File assetsDir = new Assets(PocketSphinxActivity.this).syncAssets();

                    recognizer = defaultSetup()
                            .setAcousticModel(new File(assetsDir, "en-us-ptm"))
                            .setDictionary(new File(assetsDir, "basic.dic"))
                            .setKeywordThreshold(1e-20f)
                            .setBoolean("-allphone_ci", true)
                            .setFloat("-vad_threshold", 3.0)
                            .getRecognizer();

                    recognizer.addNgramSearch(HOTWORD_SEARCH, new File(assetsDir, "basic.lm"));
                    recognizer.addKeywordSearch(HOTWORD_SEARCH, new File(assetsDir, "hotwords.txt"));
                    recognizer.addListener(PocketSphinxActivity.this);

                } catch (final IOException e) {
                    Log.e(CLS_NAME, "doInBackground IOException");
                    return e;
                }

                return null;
            }

            @Override
            protected void onPostExecute(final Exception e) {
                Log.i(CLS_NAME, "onPostExecute");

                if (e != null) {
                    e.printStackTrace();
                } else {
                    recognizer.startListening(HOTWORD_SEARCH);
                }
            }
        }.execute();
    }

    @Override
    public void onBeginningOfSpeech() {
        Log.i(CLS_NAME, "onBeginningOfSpeech");
    }

    @Override
    public void onPartialResult(final Hypothesis hypothesis) {
        Log.i(CLS_NAME, "onPartialResult");

        if (hypothesis == null)
            return;

        final String text = hypothesis.getHypstr();
        Log.i(CLS_NAME, "onPartialResult: text: " + text);

    }

    @Override
    public void onResult(final Hypothesis hypothesis) {
        // unused
        Log.i(CLS_NAME, "onResult");
    }

    @Override
    public void onEndOfSpeech() {
        // unused
        Log.i(CLS_NAME, "onEndOfSpeech");
    }


    @Override
    public void onError(final Exception e) {
        Log.e(CLS_NAME, "onError");
        e.printStackTrace();
    }

    @Override
    public void onTimeout() {
        Log.i(CLS_NAME, "onTimeout");
    }

    @Override
    public void onDestroy() {
        super.onDestroy();
        Log.i(CLS_NAME, "onDestroy");

        recognizer.cancel();
        recognizer.shutdown();
    }
}

Note:- Should I alter my selected key-phrases (and other related files) to be more dissimilar and I test the implementation in a quiet environment, the setup and thresholds applied work very successfully.

Problems

  1. When I say either wakeup you or wakeup me, both will be detected.

I can't establish how to apply an increased weighting to the end syllables.

  1. When I say just wakeup, often (but not always) both will be detected.

I can't establish how I can avoid this occurring.

  1. When testing against background noise, the false positives are too frequent.

I can't lower the base thresholds I am using, otherwise the keyphrases are not detected consistently under normal conditions.

  1. When testing against background noise for a long period (5 minutes should be sufficient to replicate), returning immediately to a quiet environment and uttering the keyphrases, results in no detection.

It takes an undetermined period of time before the keyphrases are detected successfully and repeatedly - as though the test had begun in a quiet environment.

I found a potentially related question, but the links no longer work. I wonder if I should be resetting the recogniser more frequently, so to somehow reset the background noise from being averaged into the detection thresholds?

  1. Finally, I wonder if my requirements for limited keyphrases, would allow me to reduce the size of the acoustic model?

Any overhead when packaging within my application would of course be beneficial.

Very finally (honest!), and specifically hoping that @NikolayShmyrev will spot this question, are there any plans to wrap a base Android implementation/sdk entirely via gradle?

I thank you to those who made it this far...

Community
  • 1
  • 1
brandall
  • 6,094
  • 4
  • 49
  • 103

1 Answers1

2

My language model:

You do not need language model since you do not use it.

I can't lower the base thresholds I am using, otherwise the keyphrases are not detected consistently under normal conditions.

1e-20 is a reasonable threshold, you can provide the sample recording where you have false detections to give me better idea what is going on.

When testing against background noise for a long period (5 minutes should be sufficient to replicate), returning immediately to a quiet environment and uttering the keyphrases, results in no detection.

This is an expected behavior. Overall, long background noise makes it harder for recognizer to quickly adapt to audio parameters. If your task is to spot words in noisy place, it's better to use some kind of hardware noise cancellation, for example, a bluetooth headset with a noise cancellation.

Finally, I wonder if my requirements for limited keyphrases, would allow me to reduce the size of the acoustic model?

It is not possible now. If you look just for spotting you can try https://snowboy.kitt.ai

Nikolay Shmyrev
  • 24,897
  • 5
  • 43
  • 87
  • Thanks for your response Nikolay and for the link to snowboy. I understand now after some further reading that the LM is not used. Do you have any suggestion as to how I can prevent both keyphrases being detected? I'll look to link an audio file where the false positives are occurring. – brandall Sep 04 '16 at 14:44
  • Make keyphrases longer and distinctive enough and each will be reliably detected given the threshold you specify. I don't quite get your usecase where you want to use two similar phrases. – Nikolay Shmyrev Sep 04 '16 at 18:53
  • I thought if I started out making the phrases similar and perfecting the system, when it came to implementing the actual, more distinct phrases, it would be easier and I would be more knowledgeable. At least, that was my hope! – brandall Sep 04 '16 at 20:46
  • 1
    The detection is always a hard choice between user saying a slightly different phrase or different user saying same phrase just with an accent. For the first case you should not allow even slight variation in acoustics, for the second case you need to relax the model so it will still accept the phrase. There might be better algorithms for spotting to ensure that all the sounds of keyphrase are clearly articulated, but it is not what implemented in pocketsphinx, it just considers the whole phrase and tries to measure the differences. – Nikolay Shmyrev Sep 05 '16 at 11:29
  • Thanks Nikolay. One final question please. Does using `setKeywordThreshold(1e-20f)` override any thresholds that may be set higher in the keywords file? – brandall Sep 05 '16 at 11:51
  • No, thresholds in file have priority over global threshold – Nikolay Shmyrev Sep 05 '16 at 12:04