2

I am trying to create a new string from multiple strings using the most common words between these strings. For example:

$string[0] = 'Apple iPhone 4S 16GB Locked to Orange';
$string[1] = 'iPhone 4S 16GB boxed new';
$string[2] = 'iPhone 4S 16GB unlocked brand new';
$string[3] = 'Apple iPhone 16GB 4S Special Offer';
$string[4] = 'Apple iPhone 4S Sim Free';

## The new string should be:

$new_string = 'Apple iPhone 4S 16GB';

There could be hundreds of original strings, or just 2...

I don't have a clue where to start with this, any help would be really appreciated.

j0k
  • 22,600
  • 28
  • 79
  • 90
superphonic
  • 7,954
  • 6
  • 30
  • 63
  • Apple appears 3 times, iPhone appears 5 times, 4S appears 5 times, 16GB appears 4 times. What do you mean by "the most common words" ? – Alain Tiemblo Oct 13 '12 at 19:35
  • I'm just as clueless as you on this but wouldn't a good start be to chop up the strings so you can perform queries on individual words on the strings? – Simon Carlson Oct 13 '12 at 19:35
  • I don't think you've thought this through thoroughly, but just so you get some practice: use `preg_split('/\b+/m',$aString)` and `array_intersect` to determine what words occur in other strings... – Elias Van Ootegem Oct 13 '12 at 19:37
  • @Ninsuo: Good question, I guess the word count that would determine whether it is put in the new string would need to be set somehow based on the amount of original strings. I think you can see what I am trying to accomplish though, If a human was given the task to write a common shorter title using the strings above, you would get the $new_string I mention above.. – superphonic Oct 13 '12 at 19:40

4 Answers4

3

You can try

$string = array();
$string[0] = 'Apple iPhone 4S 16GB Locked to Orange';
$string[1] = 'iPhone 4S 16GB boxed new';
$string[2] = 'iPhone 4S 16GB unlocked brand new';
$string[3] = 'Apple iPhone 16GB 4S Special Offer';
$string[4] = 'Apple iPhone 4S Sim Free';

print(getCommon($string));

Output

Apple iPhone 4S 16GB

Function Used

function getCommon($array,$occurance = 3)
{
    $array = array_reduce($array, function($a,$b) { $a = array_merge($a,explode(" ", $b)); return $a; },array());
    return implode(" ",array_keys(array_filter(array_count_values($array),function($var)use($occurance) {return $var > $occurance ;})));
}

See Live DEMO

Baba
  • 94,024
  • 28
  • 166
  • 217
  • I think this is absolutely spot on and exactly what I needed. I am going to except this answer. Thanks to Peehaa though, your answer gave me the count of words but wasn't a complete working answer. @Baba, would you mind explaining what is going on in your code, I like to understand as much as I like to cut&paste :) Also do you have suggestions for how best to determine the occurance to use.. Thanks again. – superphonic Oct 13 '12 at 19:54
1

Something like the following should get you started:

function getWordCount($someArray)
{
    $wordList = array();
    foreach($someArray as $item) {
        $wordList = array_merge($wordList, explode(' ', $item));
    }

    $result = array_count_values($wordList);
    arsort($result);

    return $result;
}

Note I explode based on space character and this doesn't take into account punctuation etc like . or ,. If you want to account for this you should use some simple regex pattern to get the words in the string according to your requirement.

Demo: http://codepad.viper-7.com/IuAc2s

PeeHaa
  • 71,436
  • 58
  • 190
  • 262
1

Another way to do it

$min_times_present = 3;
$words  = array();
foreach ($string as $str) {
  $words_string = preg_split('/\s+/', $str, 0, PREG_SPLIT_NO_EMPTY);
  foreach ($words_string as $word) {
    $words[$word] = (isset($words[$word])) ? $words[$word]+1 : 1;
  }
}
$result_arr = array_filter($words, function($value) use ($min_times_present) {
  return ($value >= $min_times_present);
});
arsort($result_arr, SORT_NUMERIC);
$result_str = implode(' ', array_keys($result_arr));
air4x
  • 5,618
  • 1
  • 23
  • 36
  • Thanks for this, is there any advantage to doing it this way? – superphonic Oct 13 '12 at 20:11
  • @superphonic Compared to the others, I don't think so. But you could benchmark the different methods posted here, using different combinations of strings and large quantities of strings. Use the one which is more robust. – air4x Oct 13 '12 at 20:16
0

I had a similar issue and my solution was to merge all the phrases in one array of words then get the words with the highest number of occurences :

$string = array();
$string[0] = 'Apple iPhone 4S 16GB Locked to Orange';
$string[1] = 'iPhone 4S 16GB boxed new';
$string[2] = 'iPhone 4S 16GB unlocked brand new';
$string[3] = 'Apple iPhone 16GB 4S Special Offer';
$string[4] = 'Apple iPhone 4S Sim Free';
$words=array();
for($i=0;$i<count($string);$i++){
    $words = array_merge($words,str_word_count($string[$i],1));
}

$instances = array_count_values($words);
arsort($instances);
$instances = array_slice($instances,0,5);
foreach($instances as $word=>$count){
    echo $word.' ';
}
    // Outputs "iPhone S GB Apple new"

The problem with this method is that if a words appears several times in the same string its number of occurences will be increased.

André
  • 287
  • 2
  • 16