I'll try to answer my own question based on Broken Link's comment (thank you for this):
You've extracted phrases consisting of 1 to 3 words from your database of documents. Among these extraced phrases there are the following phrases:
- Half Blood Prince
- Half-Blood Prince
- Halfblood Prince
For each phrase, you strip all special characters and blank spaces and make the string lowercase:
$phrase = 'Half Blood Prince';
$phrase = preg_replace('/[^a-z]/i', '', $phrase);
$phrase = strtolower($phrase);
// result is "halfbloodprince"
When you've done this, all 3 phrases (see above) have one spelling in common:
- Half Blood Prince => halfbloodprince
- Half-Blood Prince => halfbloodprince
- Halfblood Prince => halfbloodprince
So "halfbloodprince" is the parent phrase. You insert both into your database, the normal phrase and the parent phrase.
To show a "Trending Topics Admin" like Twitter's you do the following:
// first select the top 10 parent phrases
$sql1 = "SELECT parentPhrase, COUNT(*) as cnt FROM phrases GROUP BY parentPhrase ORDER BY cnt DESC LIMIT 0, 10";
$sql2 = mysql_query($sql1);
while ($sql3 = mysql_fetch_assoc($sql2)) {
$parentPhrase = $sql3['parentPhrase'];
$childPhrases = array(); // set up an array for the child phrases
$fifthPart = round($sql3['cnt']*0.2);
// now select all child phrases which make 20% of the parent phrase or more
$sql4 = "SELECT phrase FROM phrases WHERE parentPhrase = '".$sql3['parentPhrase']."' GROUP BY phrase HAVING COUNT(*) >= ".$fifthPart;
$sql5 = mysql_query($sql4);
while ($sql6 = mysql_fetch_assoc($sql5)) {
$childPhrases[] = $sql3['phrase'];
}
// now you have the parent phrase which is on the left side of the arrow in $parentPhrase
// and all child phrases which are on the right side of the arrow in $childPhrases
}
Is this what you thought of, Broken Link? Would this work?