25

I'm looking for a fully accurate statement of an algorithm to count syllables in words. What I'm finding when I research is inconsistent or what I know to generate incorrect results. Does anyone have any suggestions of how to accomplish this? Thanks.

The algorithm I'm using now:

  1. Count the number of vowels in the word.
  2. Do not count double-vowels ("rain" has 2 vowels but is only 1 syllable)
  3. If last letter in word is vowel do not count ("side" is 1 syllable)

Are there any more rules I'm missing? I'm trying to determine in testing for my incorrect results if the algorithm I'm using is wrong or my implementation of it.

durron597
  • 31,968
  • 17
  • 99
  • 158
Glenn1234
  • 2,542
  • 1
  • 16
  • 21
  • 7
    ad 2: "doable" ? Ouch! – wildplasser Feb 01 '12 at 13:02
  • because I'm dealing with Readability formulas too, I was curious- did you decide or find an efficient enough algorithm for this purpose? –  Aug 31 '12 at 17:23
  • Well there are exception cases, but what this has to do with programming? it is more of reasearch/algorithms and linguistic but nothing of programming by itself. – dhein Jul 15 '15 at 15:30
  • Related: https://stackoverflow.com/questions/405161/detecting-syllables-in-a-word – Anton Tarasenko Oct 20 '17 at 17:46

5 Answers5

27

Ambiguity is a huge issue in natural language processing, but some tasks can actually handle with the ambiguity with nice accuracy. It turns out syllabification is one of them, so don't listen to the other answers. :)

Syllabification

Heuristic-based

You could come up with algorithms achieving correct syllabification virtually throughout the English vocabulary, but it seems complicated to program correctly.

Corpus-based

As always, when hand-made algorithms don't help too much, Natural Language Processing researchers use hand-tagged corpora containing the correct answers for given words. Learnings algorithms are then used and often provide great accuracy. You can use LingPipe's syllabification (see "English syllabification") which follows this approach.

Exhaustive list

English only has so many words, which is how we came up with dictionaries. Such dictionaries often contain the correct syllabification. You could scrape reference.com. For example, the undulate entry contains « un·du·late », which is enough to know there are three syllables.

Other such dictionaries include Answers.com, The Free Dictionary, Merriam-Webster, and so on. Do read the Terms and Conditions, automated retrieval may not be allowed. And different dictionaries don't always agree with each other.

It won't help with new words or proper nouns, but I'd say it's going to be the most accurate method.

About hyphenation

Another related problem got a lot more exposure: hyphenation. But don't use that! It is used in typesetting programs such as LaTeX, but only aims to provide some of the correct hyphens, without ever providing an incorrect one (high precision, low recall). It's interesting to note that there only are 14 exceptions, eg. project which has a different hyphenation depending on the part-of-speech (verb or noun).

Hyphenation programs

If you decide that it's enough for you needs, note that a few implementations of the TeX hyphenation algorithm exist in other languages, such as Python, Perl or Ruby.

Quentin Pradet
  • 4,691
  • 2
  • 29
  • 41
  • You downvoted me for stating that there exists no 100% accurate algorithm, yet the only one you provided is the exhaustive list... – Armen Tsirunyan Feb 01 '12 at 13:54
  • I downvoted you for stating that there was no accurate algorithm, and saying that based only on a few examples. How do you define "accurate"? 100% is definitely not what we aime for in natural language processing, since inter-annotator agreement is never that high. – Quentin Pradet Feb 01 '12 at 13:58
  • I didn't say that based on the examples. I stated that based on my claim that I promised to provide a counterexample to any existing algorithm. The examples were illustratory – Armen Tsirunyan Feb 01 '12 at 13:59
  • Hmm, I can only cancel my downvote if your answer is edited. I still think it's misleading, but I would cancel the downvote if I could. – Quentin Pradet Feb 01 '12 at 14:01
  • I am not bitching about a downvote. I am conducting a constuctive dialogue :) – Armen Tsirunyan Feb 01 '12 at 14:05
16

I'm looking for a fully accurate statement of an algorithm to count syllables in words

There isn't one. Period. Whatever algorithm you invent, I promise to find a counterexample. In certain languages(Armenian and Russian come to mind) the algorithm is pretty straightforward - count the number of vowels. In other languages, such as German, it's not as straightforward but still doable. In English, I am afraid, the transduction between letters and sounds is absolutely irregular.

For example,

coincidence. oi is to be counted as two syllables. But in boil it's only one syllable. Also, not counting the final vowel is not always accurate. Consider the name Penelope or Hermione. Or banana

Another curious case is when the syllable exists without a printed vowel. For example, table is a bisyllabic word but the second syllable is generated by the invisible sound between b and l. Also, don't forget about words originated from greek, which can have a lot of consecutive vowels. E.g. onomatopoeia.

So, there is no accurate algorithm. The only way you can go is to try to find an algorithm which works in many (I am avoiding the word most) cases. But in this case you should redefine your requirements.

anefeletos
  • 672
  • 7
  • 19
Armen Tsirunyan
  • 130,161
  • 59
  • 324
  • 434
  • If it helps to know, what I'm using this for is to implement readability formulas. The two that I've selected have a variable that equates to "average number of syllables per word", which means I need to count syllables. What I am noticing however, is that in the paper I got these formulas from that *some* of my results match the examples in that paper and some don't. So I'm trying to track down how my results differ from the paper's author and this seems like the likely problem since my word counts are accurate. – Glenn1234 Feb 01 '12 at 13:10
  • It's complicated, but solutions exist. – Quentin Pradet Feb 01 '12 at 13:46
1

Old question, but still, people probably read it once in a while and it is an open question.

Words aren't built up out of discrete, well defined, agreed syllables - you try your best to separate language into syllables, and the way you do it depends on the purpose - some are more phonetic, others rely more on spelling.

Phonetic methods produce different results depending on the accent or dialect of the speaker, and/or how clearly each individual is speaking at a particular time. In some phonetic methods, syllables share sounds - i.e. the last sound in one syllable can be the first in the next, and this can cross word boundaries.

What is taught in schools (if the school bothers at all) often is a mixture of spelling and phonetic rules designed to help children spell. They try to have a few memorable rules that work a lot of the time, they aren't meant to be 100% correct or exhaustive.

With any particular method, you'll likely find things that don't sound right to you.

Now the answer: For a readability metric, it won't matter much which method is used. Even just counting letters in the words (or vowels) can work also. If you are trying to match someone else's results, then you need to know their method.

Unanimous
  • 21
  • 1
1

from typing import Counter

def splitting_into_syllables(input_word):
    count = 0
    word1 = input_word.lower()
    vowels = set("aeiou")
    syll = list()
    temp = 0
    for letter in word1:
        if letter in vowels:
         count += 1
    if count == 1:
        print(count)
        return word1
    for index in range(1,len(word1)):
        if word1[index] in vowels and word1[index - 1] not in vowels:
            w = word1[temp: index+1]
            print(w)
            if len(w) != -1:
                syll.append(w)
                temp = index+1

    print(count)
    

user_input = input()
print(splitting_into_syllables(user_input))
S P
  • 11
  • 1
  • 1
    While this code may answer the question, providing additional context regarding why and/or how this code answers the question improves its long-term value. – n. m. could be an AI Jun 19 '22 at 17:37
0

What you need is a dictionary to map regular spelling of English words to their International Phonetic Alphabet equivalents. This has more accurate representations of syllables in the words. From that, you can make a more accurate syllable count. But, that doesn't account for variations in pronunciations.

user151841
  • 17,377
  • 29
  • 109
  • 171