14

I'm attempting to create an algorithm that will suggest Mad Gab style phrases.

The input is a set of phrases. I also have a set of keywords that I'd like to use when possible. Currently, my solution is simply brute force:

  • loop over phrases (character by character)
    • if keyword is found
      • store keyword and branch (recursion)
    • increment character count

However, the problems I am running into are:

  • Account for compound keywords, e.g. "catches" can be "catches", "cat" + "cheeses"
  • Allow literal terms - "the", "and", "one", "two", "three".
  • How to suggest terms that are not keywords. i.e. fall back on something like the system dictionary when keywords or literals can not be found.
  • Skip phrase segments. Right now it just does one pass through. But consider the case where the phrase starts with something unmatched but a few characters later contains matches.

I am most familiar with PHP and MySQL. However, I am open to another technology if it provides a better solution.

I am also interested in any additional suggestions. Particularly ways to use the second parameter of metaphone() to make harder suggestions.

Jason McCreary
  • 71,546
  • 23
  • 135
  • 174
  • 1
    The puzzles I looked at were all pretty heavily dependent on 1) phonetic pronunciation and 2) knowledge of common English phrases and names of famous people. I'm not sure they can be solved without a lot more knowledge than you're working with. –  Mar 19 '12 at 23:05
  • Phonetic pronunciation = `soundex()`, right? I'm not interested in using famous people in the phrases. As noted, I'd rather use my keywords. I will indeed need more words in my word bank. Hence the third bullet point. My *solution* is incomplete. I'm very open to suggestion. – Jason McCreary Mar 20 '12 at 14:07
  • My mistake - I meant `metaphone()` for pronunciation. – Jason McCreary Mar 20 '12 at 14:13
  • 3
    Very interesting problem. You may be responsible for my getting behind my deadline at work. – Jason Mar 23 '12 at 21:19
  • @Jason, I believe so as well and hoped the bounty would help get some attention. – Jason McCreary Mar 23 '12 at 21:26
  • Maybe this can help you http://stackoverflow.com/questions/369755/how-do-i-do-a-fuzzy-match-of-company-names-in-mysql-with-php-for-auto-complete – georgepsarakis Mar 25 '12 at 09:29
  • Should the final result be selected manually by a person given a list of options? I am kind of thinking that it would be better if the gab were built interactively. – fie Mar 26 '12 at 08:32
  • @fie, yes. I am glad to select and rate the results. I'm only expecting the algorithm to provide suggestions - as complete as possible (i.e. not just a few keywords). – Jason McCreary Mar 26 '12 at 13:28
  • @JasonMcCreary While I personally would _love_ to see what others come up with, a roommate and I were throwing around a spec algorithm for this yesterday featuring a simple [Markov model](https://en.wikipedia.org/wiki/Markov_model) with scoring by lexical complexity (that is, length of each longest matched, known homophone). I'm considering giving it a full writeup. Interested? – MrGomez Mar 26 '12 at 21:39
  • 1
    @MrGomez, definitely. Provide an answer. Bounty is good for the next for days. – Jason McCreary Mar 26 '12 at 22:15
  • 1
    I'm working on one as well, may be able to post it tonight or tomorrow. – Jason Mar 27 '12 at 14:38
  • Yeah, I made some progress. Not functional though. – fie Mar 27 '12 at 19:39
  • I got swamped this week and couldn't finish my response in time. That said, the two that were posted were probably better (and far more sophisticated) than what I've got anyway. I'll still plan on completing and posting my solution anyway, when I get a chance, as I still find this so interesting. – Jason Mar 31 '12 at 00:18

2 Answers2

6

Perhaps start with a syllable division algorithm on the phrase bank. You can use even a simple resource that teaches children to divide syllables to create your rough divider method:

http://www.ewsdonline.org/education/components/scrapbook/default.php?sectiondetailid=7584

If you want a more technical, completely accurate way, there was a Ph.D. dissertation about how to do it:

http://www.tug.org/docs/liang/

Then turn each syllable into a phonetic representation using either something you roll yourself or metaphone(). You can use a similar site that explains vowel sound rules. These will only be generalizations. You will process vowels separately from consonants if you roll your own. Metaphone just uses consonants, which is fine, but not as cool as if you also took into account vowels.

Vowels: http://www.eslgold.com/pronunciation/english_vowel_sounds.html Consonants: http://usefulenglish.ru/phonetics/english-consonant-sounds

Then, you have a dictionary of English words for your word bank. There are many open-source dictionaries available that you could stick into a MySQL table.

Start with the first syllable and look for a random word in the dictionary that matches the soundex test. If you can't find one (this will generally only find one syllable words) add the additional syllable and search again.

Example:

"Logical consequence"

A. Syllable split

"lo gi cal con se quence"

B. Vowel Sounds applied

"lah gee cahl con see quince"

C. Consonant Sounds applied

"lah jee kahl kon see quinse"

D. Soundtext test (one syllable soundex -obviously too easy to guess, but it proves the concept)

"Law Gee Call Con Sea Quints"

Soundex strcmp's return a number. So if you like, you could get the soundex values of everything in your word bank in advance. Then you can quickly run the strcmp.

An example of a Soundex MySQL comparison is:

select strcmp(soundex('lah'), soundex('law'));

I think using the MySQL soundex is easier for you than the PHP soundex test if you're wanting a random result from a big database and you've already captured the soundex value in a field in your dictionary table.

My suggestion may be inefficient, but optimization is a different question.

Update:

I didn't mean to imply that my solution would only yield one syllable words. I used one syllable as the example, but if you took two of the syllables together, you'd get multi-syllable matches. In fact, you could probably just start by jamming all the syllables together and running soundex in mysql. If you find an answer, great. But then you can roll off syllables until you get the longest match you can. Then you're left with the end of the phrase and can take those together and run a match. I think that's the essence of the solution below from the other contributor, but I think you need to avoid jamming all the letters together without spaces. In English, you'd lose information that way. Think of a phrase beginning with a "th" sound. If you jam the phrase together, you lose which "th" sound is needed. "Theremin" (the instrument) has a different "th" sound than "There, a man".

Jonathan Barlow
  • 1,075
  • 6
  • 9
  • +1, novel approach, excellent partitioning of the problem space. I think this answer can be refined, but it sketches out a usable solution to the OP's question. – MrGomez Mar 28 '12 at 03:38
  • Thank you - I hope it is helpful. Sounds like a fun project. – Jonathan Barlow Mar 28 '12 at 04:19
  • I will review your answer today. At first glance though, why `soundex()` and not `metaphone()`. – Jason McCreary Mar 28 '12 at 12:54
  • Hi Jason - I was advising using the MySQL soundex, and there's not a native MySQL metaphone. – Jonathan Barlow Mar 28 '12 at 13:49
  • From my findings, `metaphone()` is more accurate than `soundex()`. However, I suppose at the end of the day I could use either interchangeably. – Jason McCreary Mar 30 '12 at 12:16
  • I'm awarding you the bounty for addressing the fundamental algorithm. Furthermore, I believe your approach at splitting syllables will ultimately yield a better result. I will also mark this as the answer once I write the code - hopefully in the next week. – Jason McCreary Mar 30 '12 at 12:22
  • Thanks, Jason. If you run into any snags, feel free to add them here and I'll do my best to watch for comments. – Jonathan Barlow Mar 30 '12 at 12:23
3

Taking a different tack from Jonathan Barlow's solution, I recommend an O(n2) algorithm that gives you the properties you seek, in randomness, robustness, and scalable difficulty. The complexity of this algorithm can be further improved in constant time or with optimizations to the modality of the search, but because the size of your input phrases is guaranteed to be small, it's not that big a deal.

  1. Construct a hash table of all known words in the Oxford English Dictionary and a map of lists of words by soundex() value. This initially sounds intractable, until you realize that there aren't actually that many of them in current use. Assuming a decent one-way hashing algorithm, this should take several megabytes, tops.

  2. Consider the words in your input phrase as a single, compressed string of characters with no word identity whatsoever, discarding whitespace and all punctuation. From this, walk the space for all character lengths, starting with a length of one, up to the full length of the amalgamated phrase minus one. For each string produced by this walk, perform a hash lookup against OED. When a word is encountered that's present in the dictionary, append its word and position to the end of a list in memory.

    (This pass will always take sum(n) time, which is by definition 0.5n(n+1). So, O(n2) it is. Its space complexity is worst-case O(n2), but in practice, a fully connected set of terms is extremely unlikely.)

  3. Now comes your difficulty slider. From the produced list, chop off the first N% of the found terms, where N is your level of difficulty. The principle here is, smaller words are easier for someone to lexically process, while longer words are more difficult to sound out and differentiate.

  4. Construct an array conforming to the original length of the phrase (without spaces and punctuation) and shuffle your list of encountered words. Now, walk the shuffled list. For each element, verify if all of the slots in the array are free for that word at its original position. If they are, keep the word and its position, marking the slots as used in the array. If they are not, iterate to the next word until the list is exhausted.*

  5. From the final output array, construct a partitioned list of unused characters in the space, treating each bag of characters as its own phrase. For this list, perform syllable detection exactly as sketched out here, passing the results to metaphone() with a percentage chance of glomming two or more syllables together. Then, for the bag of output dictionary words from 4., perform soundex(), pulling a random word from the word's mapped list of comparable soundex values. For every word that can only soundex() to itself according to the backing map of lists, perform partitioning and metaphone(). Finally, stitch the two lists of results together by sorting on position and print your result.

This is a random algorithm with what I believe to be all of the desired properties, but it's still rough in my mind.


 * Extra credit: determine the allowed overlaps for your system by character or syllable. This can make for an even larger gamut of accepted output phrases and a much higher level of difficulty.

Community
  • 1
  • 1
MrGomez
  • 23,788
  • 45
  • 72
  • I like a lot of this. But I'm curious about steps one and two. If the hash algorithm itself doesn't allow for some kind of soundex comparison, then I'm not sure why, after step 2, we would not just have a list of the words that are already in the phrase. So, after "Logical consequence" goes through one and two, without soundex, it seems like we'd have log, logic, logical, og, gi, calc, con, cons, on, seq, sequence, que. I don't know how that list of words, then, would help us get to the goal of forming phonetic phrases. But perhaps you intended a hash on the soundex of the word? – Jonathan Barlow Mar 28 '12 at 05:49
  • @JonathanBarlow Just so. We would still need a dictionary of terms with comparable `soundex()` values (I should make this more clear), but this will allow us to upgrade from syllables to complete words _that are bound similarly_ to phrases in English. Option B is to just use your approach, stapling it to a [Markov model](http://en.wikipedia.org/wiki/Markov_model) so we get interesting bindings to the terms. Oh, this is such a fun problem. :) – MrGomez Mar 28 '12 at 06:42
  • I will review your answer today. – Jason McCreary Mar 28 '12 at 12:55
  • @JasonMcCreary One obvious improvement I forgot to include: for each word that can only `soundex` to itself, perform partitioning and `metaphone`. I'll update my answer with this. – MrGomez Mar 28 '12 at 17:14
  • From fiddling with my iPhone this evening, I discovered the best algorithm by far: simply ask Siri to schedule an appointment. It thinks I say the _darnedest_ things when I talk technical. – MrGomez Mar 29 '12 at 09:30
  • Your fundamental approach isn't that much different than what I have currently. With that said, I have awarded the bounty to **Jonathan Barlow**. Nonetheless, your difficultly algorithm is awesome and above and beyond my initial question. I imagine I will explore an implementation down the road. I wish I could split the bounty or up vote more than once as I appreciate the answer. – Jason McCreary Mar 30 '12 at 12:19
  • @JasonMcCreary Hey, no problem at all! I just wanted to further explore what I feel is the most robust approach to the problem, with qualities of verisimilitude and randomness I feel are pleasing. For what it's worth, though, SO *does* provide a mechanism for this: see **Reward existing answer** [here](http://blog.stackoverflow.com/2011/09/bounty-reasons-and-post-notices/). It's below me to ask for additional rep, but given your intent, I'm not going to complain if you throw any my way. Thank you for your kind replies and the awesome puzzle, in any event. :) – MrGomez Mar 30 '12 at 16:13