Phonetic Speech Recognition

Question

I'm trying to get Latin Speech-Recognition for which I'll need, . . . not word-recognition but . . . phonetic-vowel-and-consonant-recognition (since Latin has only 40 sounds, but over 40,000 words x 60 avg. endings = 2.5 MILLION word-forms). The problem is, . . . both the Web Speech API and Google Cloud Speech only begin you with supposedly similar-sounding complete words (and from an English grammar, too, since there are no 2.5 Million-word Latin Grammars out there), and so there's no way for me to get down to processing the actual phonetic sounds, IN PARTICULAR JUST THE WORD-STEM (the first half of the word), which distinguishes each word, rather than the word-ending which uselessly (to me) tells how it's functioning in the sentence. Ideally, I'd want to have a grammar of word-stems such as

"am-" (short for amo,amare,amavi,amatus, etc.),
"vid-" (short for video,videre,vidi,visus, etc.),
"laet-" (short for laetus, laeta, laetum, etc.)
etc.

But speech-recognition technology can't search for that.
So where can I get phonetic speech recognition?

I prefer jS, pHp, or Node, and preferably client-side, rather than streaming.

Here's my code so far, for the Web Speech API. The key thing is the console.log()s which show my trying to dig into each returned possible-word's properties:

speech.onresult = function(event) { 
    var interim_transcript = '';
    var final_transcript = '';

    for (var i = event.resultIndex; i < event.results.length; ++i) { 
        if (event.results[i].isFinal) { 
            final_transcript += event.results[i][0].transcript;

            // This console.log shows all 3 word-guess possibilities.
               console.log(event.results[i]);
                    //These console.logs show each individual possibility:
                     //console.log('Poss-1:'); console.log(event.results[i][0]);
                     //console.log('Poss-2:'); console.log(event.results[i][1]);
                     //console.log('Poss-3:'); console.log(event.results[i][2]);
            for (var a in event.results[i]) {
                for (var b in event.results[i][a]) {
                  /*This black-&-yellow console.log below shows me trying to dig into
                  each returned possibility's PROPERTIES, but alas, the only 
                  returned properties are 
                  (1) the transcript (i.e. the guessed word), 
                  (2) the confidence (i.e. the 0-to-1 likelihood of it being that word)
                  (3) the prototype 
                   */
                    console.log("%c Poss-"+a+" %c "+b+": "+event.results[i][a][b], 'background-color: black; color: yellow; font-size: 14px;', 'background-color: black; color: red; font-size: 14px;'); 
                }        
            }

      } 
    }
    if (action == "start") {
        transcription.value += final_transcript;
        interim_span.innerHTML = interim_transcript;                       
    }
};

You can build the dictionary yourself. What is the expected transcript of "phonetic-vowel-and-consonant-recognition"? — guest271314, Nov 05 '17 at 00:26
I can't build a dictionary of 2.5 million possible words. A word-tree-structure might work, but the available technologies aren't designed to recognize HALF a word (just the root, not the ending). — rudminda, Nov 05 '17 at 00:29
_"I can't build a dictionary of 2.5 million possible words."_ ? Why not? _"but the available technologies aren't designed to recognize HALF a word (just the root, not the ending)."_ What do you mean by "recognize"? Again, you can create a grammar list yourself — guest271314, Nov 05 '17 at 00:30
Isn't 2.5 million too many for the speech-recognizer to search thru within a quarter-second? By "recognize" I mean 'consider-as-a-possibility.' Speech-recognizers use the Levenshtein algorithm to rank word-candidates based on percentage-likelihood of being the sound they heard. But to rank word-ROOTS (again the 1st half of the word), to a dictionary of possible word-roots, they wouldn't know where to break the sound they heard. — rudminda, Nov 05 '17 at 00:34
_"within a quarter-second"_ How is time relevant to the inquiry at original Question? — guest271314, Nov 05 '17 at 00:40
You define the algorithm. Not certain what issue is? Are you trying to disprove the possibility of you creating your own grammar list and algorithm for recognition of a variable value? — guest271314, Nov 05 '17 at 00:42
No, I can't get in to tweak the Levenshtein algorithm. That's buried somewhere within the Web Speech API and Google Cloud Speech main code. — rudminda, Nov 05 '17 at 00:45
_"No, I can't get in to tweak the Levenshtein algorithm. That's buried somewhere within the Web Speech API and Google Cloud Speech main code."_ Yes, you can. Else, why did you pose the Question at SO? _"Time is relevant, because this is for a live-chat app."_ What are your current benchmarks for the code that you have tried? What is expected result? See https://stackoverflow.com/help/how-to-ask, https://stackoverflow.com/help/mcve — guest271314, Nov 05 '17 at 00:50
_"Where can I get"_ Can you include the code that you have tried to resolve inquiry at Question? — guest271314, Nov 05 '17 at 01:01
You can set `.interimResults` to `true`, `.maxAlternatives` to a number greater than `1` and continuously evaluate `.results` `.transcript` using a fuzzy logic algorithm to match the specific phonetics that you associate with a given root. You can also use an analyzer to record and chart the specific ranges of the sounds of the portions of the portions of words expected. — guest271314, Nov 05 '17 at 02:16

score 0 · Answer 1 · answered Nov 05 '17 at 00:36

0

You can use create a SpeechGrammarList. See also JSpeech Grammar Format.

Example description and code at MDN

The SpeechGrammarList interface of the Web Speech API represents a list of SpeechGrammar objects containing words or patterns of words that we want the recognition service to recognize.

Grammar is defined using JSpeech Grammar Format (JSGF.) Other formats may also be supported in the future.

var grammar = '#JSGF V1.0; grammar colors; public <color> = aqua | azure | beige | bisque | black | blue | brown | chocolate | coral | crimson | cyan | fuchsia | ghostwhite | gold | goldenrod | gray | green | indigo | ivory | khaki | lavender | lime | linen | magenta | maroon | moccasin | navy | olive | orange | orchid | peru | pink | plum | purple | red | salmon | sienna | silver | snow | tan | teal | thistle | tomato | turquoise | violet | white | yellow ;'
var recognition = new SpeechRecognition();
var speechRecognitionList = new SpeechGrammarList();
speechRecognitionList.addFromString(grammar, 1);
recognition.grammars = speechRecognitionList;

answered Nov 05 '17 at 00:36

guest271314

1
15
104
177

I already know about Speech Grammar Lists. Again, the problem is that a grammar list is made of complete words, not the first half of each word. However, . . . it does occur to me now, that maybe the speech-recognizer might do just fine with half a word. I'll have to test.... – rudminda Nov 05 '17 at 00:42
@rudminda Not sure what the issue is as to the concept of you creating your own algorithm for filtering the captured audio? For example, `ls` is a command, not necessarily a word in a common dictionary, though the audio is captured by the service and the transcript is `"LS"` see [How can I extract the preceding audio (from microphone) as a buffer when silence is detected (JS)?](https://stackoverflow.com/questions/46543341/how-can-i-extract-the-preceding-audio-from-microphone-as-a-buffer-when-silence). What you do or do not do with the result is a different inquiry. – guest271314 Nov 05 '17 at 00:45
After the computer captures the audio, it STARTS me with 1-or-several supposedly-similar-sounding English words, but often nothing like the original spoken Latin word. If this is what I'm starting from, then I have no hope of reliably filtering it . – rudminda Nov 05 '17 at 00:49
@rudminda You have the ability to define the grammar. Have you tried to compose the grammar list that you are describing? What is the actual problem that you are trying to solve? – guest271314 Nov 05 '17 at 00:52
The Actual Problem: To get speech-recognition to search a grammar of word-stems rather than complete words. (You can't just type the Latin-stem "for" into the grammar, and hope that the recognizer will connect the spoken sound "/fororum/" ["of the forums"] to the desired singular Latin dictionary-entry "forum", rather than to the other undesired Latin dictionary-entries "for" [to speak], "foras" [outside], "foris" [doors], or "fore" [going to be].) – rudminda Nov 05 '17 at 01:31

Phonetic Speech Recognition

1 Answers1