13

I would like to programmatically check whether a string can be pronounced or needs to be spelled out.

For example, internationalization can be read out, but i18n cannot, nor can hhdirgxzf.

I can think of some simple heuristics such as checking whether the string contains non-alpha characters, but I hope there is a more robust and scientific way to do it. Are there algorithmic approaches that can score a string based on how easy it is to pronounce?

Related: Is there a way to rank the difficulty of pronunciation of a word?, however I don't have a list and I can't precompute.


Update based on comments.

  • As I'm an English speaker I'm interested in English but I could imagine an algorithm that was based on the way sound and speaking works rather than the characteristics of a particular language.
  • By pronounced I mean the string can be read out naturally, it's possible to pronounce hhdirgxzf but it would not sound one natural language word, it would need to be broken up.
  • a specific use case I have in mind is where I am sent strings, and I want to use a basic text-to-speech system to read them out loud. I want to determine which tokens in the string to let the TTS system try to pronounce, and which to make it spell out, erring on the side of spelling out if not confident.
Community
  • 1
  • 1
brabster
  • 42,504
  • 27
  • 146
  • 186
  • 3
    Pronounced by who? Mandarin speakers? Swedish speakers? English speakers? Everyone? – Emil Vikström Aug 29 '12 at 10:04
  • I have no idea if it will work, but I'd try to extract features from the data (vowels locations, consonants in a row,...), and use some classification algorithm after manually labeling a set of samples. (Never tried it so I have no idea if it'll give good results) – amit Aug 29 '12 at 10:05
  • 3
    I can pronounce `i18n`, something like `eye-ate-een-en`. Your other example is a bit more of a challenge but I'll give it a go ... – High Performance Mark Aug 29 '12 at 10:07
  • Those can be pronounced. `i18n` -> `eye-eighteen-en`, and `hhdirgxzf` -> `hud-er-gux-zuf`. – aroth Aug 29 '12 at 10:07
  • @aroth: I think the second example is closer to `hu-hu-der-gez-zof` – High Performance Mark Aug 29 '12 at 10:08
  • 3
    Pronounceability might be something that TTS (Text To Speech) engines could offer an opinion on, since they will have had to do the hard work of syllabification anyway. Doing this yourself would be a sizeable task - have fun with eg "syzygy", "strength", "Knightsbridge"... – AakashM Aug 29 '12 at 10:11
  • "Scone" can be pronounced, but many people do so *incorrectly*. "Realise" can be read out, but in dictation should probably be spelled out anyway because British English has two spellings of the word and this is the British-only version. Likewise my name, "Stephen" can be pronounced but often must be partly spelled out. – Steve Jessop Aug 29 '12 at 10:12
  • 1
    Btw, interestingly it turns out that "the way sound and speaking works" depends in part on language. Speech processing by the brain is in part "programmed" by the phonemes you hear. So some (not all) East Asians have difficulty distinguishing the English "l" and "r" sounds. Many Westerners can't distinguish the South Asian soft "d" sound from hard "th", and most can't pronounce it. I have a lisp in some languages because I can't roll my "r"s, so there are normal French words I can't pronounce properly and Spanish is a nightmare, but I can hear the difference fine. And so forth. – Steve Jessop Aug 29 '12 at 10:23
  • 2
    Check your words against an english dictionary file – Nicolas Repiquet Aug 29 '12 at 10:32
  • @aroth I think High Performance Mark has given himself away as being from inner rather than outer Qwghlm – Pete Kirkham Aug 29 '12 at 20:50

3 Answers3

2

You might have some success by first splitting the word into syllables. This question on SO might help. Of course, this will only work for languages which, like English, use an alphabet which includes letters and whose letters include vowel sounds.

Community
  • 1
  • 1
High Performance Mark
  • 77,191
  • 7
  • 105
  • 161
0

Maybe count the alpha characters, and divide them with the length of the string. Score based on alpha characters density? Also, maybe decrease score per number?

Jacob Lauritzen
  • 2,690
  • 1
  • 18
  • 17
0

What is the source of these strings? If you are generating them yourself, then you could try to generate likely pronounceable strings. Ideas that might work include:

  • start with a word and replace vowels with other vowels and consonants with similar consonants.

  • generate a random Soundex and work backwards to a word that generates that Soundex.

  • concatenate three or four pronounceable syllables.

  • alternate consonants and vowels.

  • Lorem Ipsum

rossum
  • 15,344
  • 1
  • 24
  • 38
  • Actually, the strings are sent to me, and I've assumed I'll need to tokenize before doing anything. I have no idea what will be in there and need to try and work out whether a text-to-speech engine will be able to pronounce each 'word' or not. – brabster Aug 29 '12 at 14:29