Rhyme Logic using CMU Pronouncing Dictionary
OK. Suppose you want to use CMU Pronouncing Dictionary data (example file: cmudict-0.7b) to build a list of all the words that rhyme with "LOVE".
Here's how you might do it:
First, you need to learn the pronunciation of "LOVE". You'll find this line in the dictionary, where "LOVE" and "L AH1 V" are separated by two spaces:
LOVE L AH1 V
This is saying that the word LOVE
is pronounced like L AH1 V
.
Then, find the vowel phoneme that has primary stress. In other words, look for the number "1" in that pronunciation. The text directly to the left of the 1 is the vowel sound that has primary stress (AH
). That text, and everything to the right of it are your "rhyme phonemes" (for the lack of a better term). So the rhyme phonemes for LOVE are AH1 V
.
We're half done! Now we just have to find other words whose pronunciations end with AH1 V
. If you're playing along in Notepad++, try a Find All In Current Document for pattern AH1 V$
using Search Mode of "Regular expression". This will match lines like:
Line 392: ABOVE AH0 B AH1 V
Line 10266: BELOVE B IH0 L AH1 V
Line 30204: DENEUVE D IH0 N AH1 V
Line 30205: DENEUVE(1) D IY0 N AH1 V
Line 34064: DOVE D AH1 V
Line 48177: GLOVE G L AH1 V
Line 49053: GOV G AH1 V
... etc
Rhyming woooooords!
There are plenty of ways to implement this, and plenty of corner cases, but this is roughly the approach that many electronic rhyming dictionaries appear to take when finding perfect rhymes.
Hypothetical SQL approach to storing rhyme data
Obviously, performance will be a problem if you just scan the dictionary every time someone wants a rhyme. If that's a concern, you might try storing or indexing the data differently.
Although it's not the most efficient on disk space, I've had a good experience storing this stuff in a SQL table with indexed columns.
For a simple conceptual example, you could compute the "rhyme phonemes" of all words in the dictionary, then insert them into a "Rhymes" table whose columns are { WordText, RhymePhonemes }. For example, you might see records like:
{"ABOVE", "AH1 V"}
{"DOVE", "AH1 V"}
{"OUTLIVE", "IH1 V"}
{"GRADUATE", "AE1 JH AH0 W AH0 T"}
{"GRADUATE", "AE1 JH AH0 W EY2 T"}
... etc
Then, to find rhymes, you'd issue a query like:
SELECT OTHER.WordText
FROM Rhymes INPUT
INNER JOIN Rhymes OTHER ON OTHER.RhymePhonemes = INPUT.RhymePhonemes
WHERE INPUT.WordText = 'love' AND
OTHER.WordText <> INPUT.WordText
ORDER BY OTHER.WordText
This also comes in handy if you're planning on printing a dictionary where all similar-sounding words are grouped together.
There are of course plenty of other ways to store/search the data of varying trade-offs, but hopefully this gets you started.
I've also had some luck storing the raw pronunciation in the database in varying "full" formats (forward and reversed strings of the pronunciation, with stress marks and without stress marks, etc) but not "chopped" into specific pieces like a rhyme-phoneme column.
Gotchas
Again, the original explanation with "love" will absolutely get you in the ballpark of rhyming. However, along the way you'll probably run into other gotchas to consider. Here's a heads-up:
- Some words have multiple pronunciations. In the CMU dictionary, the alternate pronunciations are marked with text like
(1)
, (2)
, etc following the word as in GRADUATE(2)
. If someone wants a rhyme of these words, you have to decide between showing rhymes of ALL matched pronunciations, or having the user choose which pronunciation they really meant.
- What do you do when the pronunciation has two or more "1"s? Pick the first one? Pick the last one? If you pick the last one, you'll find more rhymes, but it might not be the most natural choice of stress.
- What do you do when the pronunciation has no "1"s? It doesn't happen a lot, but it happens, like:
ACCREDIT AH0 K R EH2 D AH0 T
and AIKIN EY0 K IH0 N
. In this case I'd pick the next best stress (e.g. pick the 2, 3, 4, etc if the 1 is absent). If they're all 0's, I don't have any good advice.
- Some pronunciations are missing. It's a great start, but it doesn't have all the words or spellings of words you might want. US spelling is preferred over UK spelling.
- Some pronunciations are not what you'd expect, and you may want to prune. For example there's a pronunciation of "or" that sounds like "er".
- You may want to compare the "rhyme phonemes" with stress marks removed. This only matters for words whose primary stress is not on the last vowel (so you don't see the problem on the "love" example).