Android Kotlin: SpeechRecognizer not being precise in English

Question

I'm working on an English lessons app, and one feature I'm trying to implement is "Listen And Repeat" lessons. For that purpose I'm using Android's SpeechRecognizer, but I'm not having the expected results.

This next is my "Speech Recognition" activity:

class MainActivity : AppCompatActivity(), RecognitionListener {
   private val permission = 100
   private lateinit var returnedText: TextView
   private lateinit var toggleButton: ToggleButton
   private lateinit var progressBar: ProgressBar
   private lateinit var speech: SpeechRecognizer
   private lateinit var recognizerIntent: Intent
   private var logTag = "VoiceRecognitionActivity"
   override fun onCreate(savedInstanceState: Bundle?) {
      super.onCreate(savedInstanceState)
      setContentView(R.layout.activity_main)
      title = "KotlinApp"
      returnedText = findViewById(R.id.textView)
      progressBar = findViewById(R.id.progressBar)
      toggleButton = findViewById(R.id.toggleButton)
      progressBar.visibility = View.VISIBLE
      speech = SpeechRecognizer.createSpeechRecognizer(this)
      Log.i(logTag, "isRecognitionAvailable: " + SpeechRecognizer.isRecognitionAvailable(this))
      speech.setRecognitionListener(this)
      recognizerIntent = Intent(RecognizerIntent.ACTION_RECOGNIZE_SPEECH)
      recognizerIntent.putExtra(RecognizerIntent.EXTRA_LANGUAGE_PREFERENCE, "en-US") //also tried "en_US" or "en-UK" and also tried Locale.US.toString(). Also replaced EXTRA_LANGUAGE_PREFERENCE with EXTRA_LANGUAGE
      recognizerIntent.putExtra(RecognizerIntent.EXTRA_LANGUAGE_MODEL,
      RecognizerIntent.LANGUAGE_MODEL_FREE_FORM)
      recognizerIntent.putExtra(RecognizerIntent.EXTRA_MAX_RESULTS, 3)
      toggleButton.setOnCheckedChangeListener { _, isChecked ->
         if (isChecked) {
            progressBar.visibility = View.VISIBLE
            progressBar.isIndeterminate = true
            ActivityCompat.requestPermissions(this@MainActivity,
            arrayOf(Manifest.permission.RECORD_AUDIO),
            permission)
         } else {
            progressBar.isIndeterminate = false
            progressBar.visibility = View.VISIBLE
            speech.stopListening()
         }
      }
   }
   override fun onRequestPermissionsResult(requestCode: Int, permissions: Array<String?>,
   grantResults: IntArray) {
      super.onRequestPermissionsResult(requestCode, permissions, grantResults)
      when (requestCode) {
         permission -> if (grantResults.isNotEmpty() && grantResults[0] == PackageManager
         .PERMISSION_GRANTED) {
            speech.startListening(recognizerIntent)
         } else {
            Toast.makeText(this@MainActivity, "Permission Denied!",
            Toast.LENGTH_SHORT).show()
         }
      }
   }
   override fun onStop() {
      super.onStop()
      speech.destroy()
      Log.i(logTag, "destroy")
   }
   override fun onReadyForSpeech(params: Bundle?) {
      TODO("Not yet implemented")
   }
   override fun onRmsChanged(rmsdB: Float) {
      progressBar.progress = rmsdB.toInt()
   }
   override fun onBufferReceived(buffer: ByteArray?) {
      TODO("Not yet implemented")
   }
   override fun onPartialResults(partialResults: Bundle?) {
      TODO("Not yet implemented")
   }
   override fun onEvent(eventType: Int, params: Bundle?) {
      TODO("Not yet implemented")
   }
   override fun onBeginningOfSpeech() {
      Log.i(logTag, "onBeginningOfSpeech")
      progressBar.isIndeterminate = false
      progressBar.max = 10
   }
   override fun onEndOfSpeech() {
      progressBar.isIndeterminate = true
      toggleButton.isChecked = false
   }
   override fun onError(error: Int) {
      val errorMessage: String = getErrorText(error)
      Log.d(logTag, "FAILED $errorMessage")
      returnedText.text = errorMessage
      toggleButton.isChecked = false
   }
   private fun getErrorText(error: Int): String {
      var message = ""
      message = when (error) {
         SpeechRecognizer.ERROR_AUDIO -> "Audio recording error"
         SpeechRecognizer.ERROR_CLIENT -> "Client side error"
         SpeechRecognizer.ERROR_INSUFFICIENT_PERMISSIONS -> "Insufficient permissions"
         SpeechRecognizer.ERROR_NETWORK -> "Network error"
         SpeechRecognizer.ERROR_NETWORK_TIMEOUT -> "Network timeout"
         SpeechRecognizer.ERROR_NO_MATCH -> "No match"
         SpeechRecognizer.ERROR_RECOGNIZER_BUSY -> "RecognitionService busy"
         SpeechRecognizer.ERROR_SERVER -> "error from server"
         SpeechRecognizer.ERROR_SPEECH_TIMEOUT -> "No speech input"
         else -> "Didn't understand, please try again."
      }
      return message
   }
   override fun onResults(results: Bundle?) {
      Log.i(logTag, "onResults")
      val matches = results!!.getStringArrayList(SpeechRecognizer.RESULTS_RECOGNITION)
      var text = ""
      for (result in matches) text = """
      $result
      """.trimIndent()
      returnedText.text = text
   }
}

The microphone permissions are being requested correctly and recognition engine is working, but the results are not satisfactory, and I'll give two examples:

I say "This is my bedroom" > it recognices "This is my bedrun" (among others).
I say "The soldier was killed during the war" > it recognices (in the best of cases) "The souldier was killed during the wall/world/some others".

I know the recognition engine is not perfect and of course also know every voice is different, but there are some notes:

In Spanish (my phone default language) the recognition is far more precise than in English.
If I set my phone language to English (in Settings) it improves recognition a little, but don't think much.
I've noticed that Google Chrome speech recognition is far more precise that Android's (why?)

In the end, do you know -given my activity code- of any way I could improve English speech recognition in my app?

score 2 · Answer 1 · answered May 31 '23 at 14:41

Speech recognition is not a precise science. It's tech that's still actively being developed. Also, if your native language is Spanish, you may have an accent that makes it more difficult (it doesn't even need to be a hard to understand accent to people, just enough that the phoneme shifts) but that wouldn't be difficult in Spanish because it's a normal Spanish speaking accent. Your only real option is to try to speak more clearly, or to try another speech to text engine. There are plenty out there, but most are not free. Also switching engines may make it better for you but not for others. So do your research if you go that route.

Also, if you decide to work on this- be scientific about it. Don't say the same phrase to each engine- record it and use the recording. It's so easy to bias things like this- for example setting your phone language almost certainly has zero effect. But you can easily get confirmation bias.

Thank you very much for your input @GabeSechan. I'll most likely mark it as correct. — Diego Perez, May 31 '23 at 15:37

cactustictacs · Accepted Answer · 2023-05-31T16:22:12.710

Adding to Gabe's answer, if this is an app aimed at English-language students you might want to bake a bit of leeway into your system, allow for a certain level of inaccuracy, y'know? Learners by definition are people whose knowledge and skills aren't perfect, so you'd expect their pronunciation to be different from a native speaker, which speech recognition will typically pick up on.

And this can be a problem for native speakers too! There are so many English accents, and for a lot of people recognition software will often "mishear" certain words. It's not perfect, and it's not really capable of evaluating people's speaking accuracy, or how well they adopt a particular accent. So a bit of wiggle room in interpreting the results might be helpful, rather than expecting users to match a standard many native speakers can't meet consistently, if you see what I mean.

For example, some people pronounce bedroom as bedrum in British English (I don't though!) so for those people, your example of bedrun is actually very close! Not a lot of difference between /m/ and /n/, especially in normal conversational speech. For me, that would get a pass. The bigger question though, is why didn't the recognition software see that word as bedroom rather than whatever a bedrun is? It feels like it's targetting a specific accent or range of accents, probably because you're passing en-US as the target language and region/dialect.

So I guess what it comes down to is a question of what your goal is here. Do you want students to match what the recognition software is expecting? If so, then your pronunciations are "wrong" and you need to work on them (e.g. emphasising the oo in bedroom) until it says you're "correct". For a formal language drill this could be fine, but I suspect it would be pretty frustrating for many people who just want to practice, but who don't already sound like whatever native speaker the software is comparing them to.

But if you're just aiming for some level of "good enough", some amount of consistency etc, you'll need to allow for results that aren't 100% perfect matches. And of course this is the tricky part - what do you allow for? How do you go about scoring things? How do you weight parts of a sentence, e.g. unimportant stuff like the pronunciation of the vs the important words. How do you actually turn that into an algorithm in code?

This is probably a pretty complex problem, and I feel like a lot of similar apps err on the side of "good enough". I've used Duolingo and messed up pronunciations that the app just waves through, so even big apps like that aren't perfect. I think you'll need to try some stuff out, and do some experimentation - and really, the more people you can get involved testing your algorithm, the better. You can't rely on making it work for you, you know? That's fine for starting out, trying some stuff, but if you get into fine-tuning things you really need a wide base of training material (there are probably collections of audio samples out there though).

Anyhow some ideas you can maybe look into:

See if you can request other dialects instead of just en-US and maybe run the recognition against those too. Maybe combine the results and see if one matches, or automatically switch to just using the one that gives the best results for this user, or let them pick a region themselves. Remember the device locale isn't necessarily a good indicator of what a person sounds like!
Instead of trying to match strings, split the words and see how many of those match. Maybe weight certain words lower (e.g. with a dictionary of common connecting words like the, a, and etc.) if unimportant stuff is consistently causing match failures (which probably depends on the recogniser)
Curate a set of "questions" and common "answers" that students give, so you can treat certain pronunciations explicitly (like "this one is close enough")
Maybe see if you can find a dictionary of words and what they "sound similar to", like war and wall, and use that for weighting your scoring
If you want to get real fancy, see if you can find an IPA dictionary and compare words by phonemes instead
Check the confidence scores you get and maybe reject low ones ("please try again") or let them weight your overall score (e.g. instead of wanting 90% of the words correct, let that drop for a low confidence score)
Lower the matching score threshold depending on the learner's level. Let people have some wins and only care about perfection for higher-level students

Hope that gives you some ideas! Some are obviously more work than others. And sorry for the long answer, I just wanted to give a bigger overall perspective so you have a better idea of what you want to do. And it's probably worth looking around the internet to see what others have done - there's probably a ton of research in this area, with stuff that focuses on non-native speakers in particular. There might be some approaches that give you ideas or even a general algorithm you can implement.

Good luck!

Thank you very much for such a complete and detailed answer @cactustictacs. I appreciate your time very much, and you were very helpful indeed. I'll accept your answer and I'll post another answer with an idea I have so you can say what do you think. — Diego Perez, May 31 '23 at 19:17

score 0 · Answer 3 · answered May 31 '23 at 19:24

0

I appreciate both answers very much, thanks.

I'm posting a new answer so you can say what do you think about my idea.

I'm aware speech recognition is far away from being perfect and of course I'm open t allow a certain level of inaccuracy, in fact was one of my first ideas after my first speech recognition attempts.

I thought I could listen the user and then compare both string using the method in the following post.

Similarity String Comparison in Java

I should establish a percentage of "error" acceptance, lets say 25%, so a comparison that returns 75% similarity or more would be accepted, the rest would be considered wrong.

What do you think?

answered May 31 '23 at 19:24

Diego Perez

2,188
2
30
58

1

So that algorithm looks basically like hamming distance- measuring accuracy based on the number of changes it would take to turn one string into the other. This isn't a bad idea, but it can have false positives and negatives. The biggest problem is that it works on letters whereas you'd rather work on sounds. Cell would score 75% for the word sell, but they sound identical when spoken. their and they're would be similarly bad. Pot and opt are nothing alike in sound, but have a 1 edit difference (an swap of o and p). – Gabe Sechan May 31 '23 at 20:13
1

It's a "better than nothing" solution for sure. But really it would be nice if you were working at a lower level and working with sounds instead of letters. Then you could also score by how often the sounds are to a native speaker (for example, sss and shh sounds are closer to one another than sss and a hard k). But you're probably not going to get to do things like that unless you use an engine that allows you to go to a deeper level, which may not be something you have time/resources for. – Gabe Sechan May 31 '23 at 20:17
Yeah this doesn't really reflect how language works, and depending on the algorithm you use, a single changed word in two near-identical sentences could "shift" all the others to where their characters no longer match. Like Gabe says, sounds would be better (I mentioned a phoneme-based dictionary in my answer, English is really inconsistent so a reference like that would help) - but for a naive approach, just stripping punctuation, splitting the string on spaces, and working on individual words would be better I think. Compare the lists, look for word groups, work out which are "missing" etc – cactustictacs May 31 '23 at 21:09
OK guys, I think I got it. Musicg Android library can, among other things, compare two audio files in. wav format. So I'll have to save both audios as wav in app cache directory and use Music to detect similarity. https://github.com/loisaidasam/musicg – Diego Perez May 31 '23 at 22:38

Android Kotlin: SpeechRecognizer not being precise in English

3 Answers3