17

I have a string with some characters, and I'm looking for the organization of those characters such that it's the most pronounceable possible.

For example, if I have the letters "ascrlyo", there are some arrangements that would be more pronounceable than others. The following may get a "high score":

scaroly crasoly

Where as the following may get a low score:

oascrly yrlcsoa

Is there a simple algorithm I can use? Or better yet, a Python functionality that achieves this?

Thank you!

Roee Adler
  • 33,434
  • 32
  • 105
  • 133
  • 2
    You'll need a solid knowledge of English phonetics to do this. It's really not a simple problem at all. – kindall Oct 24 '16 at 01:49
  • @kindall I assumed so, but one can hope someone already solved it in an elegant way... – Roee Adler Oct 24 '16 at 02:05
  • 3
    This looks like it might help: http://stackoverflow.com/a/6298193/4996248 . It describes a Python gibberish detector. Find the permutation that has the smallest gibberish score. – John Coleman Oct 24 '16 at 02:33
  • Apart from the difficulties already pointed out, doesn't the problem exhibit factorial complexity in the worst case (say when all permutations are almost equally pronounceable like a string containing only vowels)? Also, unless there is a specific criteria, pronounciability is subjective to order the permutations in my opinion. – Abhishek Bansal Oct 24 '16 at 03:29
  • 1
    @AbhishekBansal The linked answer defines a possible ranking of permutations. We can use dynamic programming to reduce the complexity to 2^n poly(n) where n is the number of letters, which should be OK on shortish words. – David Eisenstat Oct 24 '16 at 03:55
  • 1
    Have a look at vowel-consonant patterns. Some machine-generated password schemes use patterns like CVCCVCCVC in an attempt to produce random but pronounceable password e.g. https://www.gov.uk/government/publications/password-policy-simplifying-your-approach/password-policy-executive-summary – mcdowella Oct 24 '16 at 04:05

2 Answers2

8

Start by solving a simpler problem: is a given word pronounceable?

Machine learning 'supervised learning' could be effective here. Train a binary classifier on a training set of dictionary words and scrambled words (assume the scrambled words are all unpronounceable). For features, I suggest counting bigrams and trigrams. My reasoning: unpronounceable trigrams such as 'tns' and 'srh' are rare in dictionary words, even though the individual letters are each common.

The idea is that the trained algorithm will learn to classify words with any rare trigrams as unpronounceable, and words with only common trigrams as pronounceable.


Here's an implementation with scikit-learn http://scikit-learn.org/

import random
def scramble(s):
    return "".join(random.sample(s, len(s)))

words = [w.strip() for w in open('/usr/share/dict/words') if w == w.lower()]
scrambled = [scramble(w) for w in words]

X = words+scrambled
y = ['word']*len(words) + ['unpronounceable']*len(scrambled)

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y)

from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB

text_clf = Pipeline([
    ('vect', CountVectorizer(analyzer='char', ngram_range=(1, 3))),
    ('clf', MultinomialNB())
    ])

text_clf = text_clf.fit(X_train, y_train)
predicted = text_clf.predict(X_test)

from sklearn import metrics
print(metrics.classification_report(y_test, predicted))

It scores 92% accuracy. Given pronounceability is subjective anyway, this might be as good as it gets.

                 precision    recall  f1-score   support

      scrambled       0.93      0.91      0.92     52409
           word       0.92      0.93      0.93     52934

    avg / total       0.92      0.92      0.92    105343

It agrees with your examples:

>>> text_clf.predict("scaroly crasoly oascrly yrlcsoa".split())
['word', 'word', 'unpronounceable', 'unpronounceable']

For the curious, here are 10 scrambled words it classifies pronounceable:

  • moro garapm ocenfir onerixoatteme arckinbo raetomoporyo bheral accrene cchmanie suroatipsheq

And finally 10 dictionary words misclassified as unpronouncable:

  • ilch tohubohu usnea halfpaced pyrostilpnite lynnhaven cruel enure moldproof piecemeal
Colonel Panic
  • 132,665
  • 89
  • 401
  • 465
  • 1
    Good answer. This is a tough problem for English as it is very much the paragon of phonetic nonconformity. – erip Oct 31 '16 at 11:11
  • I really like this approach the problem but clearly the model needs more training as it mis-categorises 60% of the 'unpronounceable' words. 'halfpaced', 'lynnhaven', 'cruel', 'enure', 'moldproof' and 'piecemeal' are all clearly pronounceable. How would one train the model to be more accurate? – jimiclapton Apr 27 '20 at 09:07
0

(For completeness, here's my original pure Python solution that inspired me to try machine learning.)

I agree a reliable solution would require a sophisticated model of the English language, but maybe we can come up with a simple heuristic that's tolerably bad.

I can think of two basic rules satisfied by most pronouncable words:

1. contain a vowel sound
2. no more than two consonant sounds in succession

As a regular expression this can be written c?c?(v+cc?)*v*

Now a simplistic attempt to identify sounds from spelling:

vowels = "a e i o u y".split()
consonants = "b bl br c ch cr chr cl ck d dr f fl g gl gr h j k l ll m n p ph pl pr q r s sc sch sh sl sp st t th thr tr v w wr x y z".split()

Then it's possible to the rules with regular expressions:

v = "({0})".format("|".join(vowels))
c = "({0})".format("|".join(consonants))

import re
pattern = re.compile("^{1}?{1}?({0}+{1}{1}?)*{0}*$".format(v, c))
def test(w):
    return re.search(pattern, w)

def predict(words):
    return ["word" if test(w) else "scrambled" for w in words]

This scores about 74% on the word/scrambled test set.

             precision    recall  f1-score   support

  scrambled       0.90      0.57      0.70     52403
       word       0.69      0.93      0.79     52940

avg / total       0.79      0.75      0.74    105343

A tweaked version scored 80%.

Colonel Panic
  • 132,665
  • 89
  • 401
  • 465