1

I'm desperately trying to overcome the following issue: out of an array of sentences/news titles, I'm trying to find those which are very similar (have some 3 or 4 words in common) and put them into a new array. So, for this original array/list:

'Title1: Hackers expose trove of snagged Snapchat images',
'Title2: New Jersey officials order symptom-less NBC News crew into Ebola quarantine',
'Title3: Family says goodbye at funeral for 16-year-old',
'Title4: New Jersey officials talk about Ebola quarantine',
'Title5: New Far Cry 4 Trailer Welcomes You to Kyrat Lowlands',
'Title6: Hackers expose Snapchat images'

The result should be:

Array
(
    [0] => Title1: Hackers expose trove of snagged Snapchat images
    [1] => Array
        (
            [duplicate] => Title6: Hackers expose Snapchat images
        )

    [2] => Title2: New Jersey officials order symptom-less NBC News crew into Ebola quarantine
    [3] => Array
        (
            [duplicate] => Title4: New Jersey officials talk about Ebola quarantine
        )
    [4] => Title3: Family says goodbye at funeral for 16-year-old
    [5] => Title5: New Far Cry 4 Trailer Welcomes You to Kyrat Lowlands
)

This is my code:

    $titles = array(
    'Title1: Hackers expose trove of snagged Snapchat images',
    'Title2: New Jersey officials order symptom-less NBC News crew into Ebola quarantine',
    'Title3: Family says goodbye at funeral for 16-year-old',
    'Title4: New Jersey officials talk about Ebola quarantine',
    'Title5: New Far Cry 4 Trailer Welcomes You to Kyrat Lowlands',
    'Title6: Hackers expose Snapchat images'
    );
$z = 1;
foreach ($titles as $feed)
{
    $feed_A = explode(' ', $feed);
    for ($i=$z; $i<count($titles); $i++)
    {
        $feed_B = explode(' ', $titles[$i]);
        $intersect_A_B = array_intersect($feed_A, $feed_B);
        if(count($intersect_A_B)>3)
        {
            $titluri[] = $feed;
            $titluri[]['duplicate'] = $titles[$i]; 
        }
        else 
        {
            $titluri[] = $feed;
        }
    }
    $z++;
}

It outputs this [awkward, but somewhat colse to the desired] result:

Array
(
    [0] => Title1: Hackers expose trove of snagged Snapchat images
    [1] => Title1: Hackers expose trove of snagged Snapchat images
    [2] => Title1: Hackers expose trove of snagged Snapchat images
    [3] => Title1: Hackers expose trove of snagged Snapchat images
    [4] => Title1: Hackers expose trove of snagged Snapchat images
    [5] => Array
        (
            [duplicate] => Title6: Hackers expose Snapchat images
        )

    [6] => Title2: New Jersey officials order symptom-less NBC News crew into Ebola quarantine
    [7] => Title2: New Jersey officials order symptom-less NBC News crew into Ebola quarantine
    [8] => Array
        (
            [duplicate] => Title4: New Jersey officials talk about Ebola quarantine
        )

    [9] => Title2: New Jersey officials order symptom-less NBC News crew into Ebola quarantine
    [10] => Title2: New Jersey officials order symptom-less NBC News crew into Ebola quarantine
    [11] => Title3: Family says goodbye at funeral for 16-year-old
    [12] => Title3: Family says goodbye at funeral for 16-year-old
    [13] => Title3: Family says goodbye at funeral for 16-year-old
    [14] => Title4: New Jersey officials talk about Ebola quarantine
    [15] => Title4: New Jersey officials talk about Ebola quarantine
    [16] => Title5: New Far Cry 4 Trailer Welcomes You to Kyrat Lowlands
)

Any suggestions would be much appreciated!

VA13
  • 51
  • 8
  • I give you some useful links than could help you. [Highlight the difference between two strings in PHP](http://stackoverflow.com/questions/321294/highlight-the-difference-between-two-strings-in-php). Also you could give a look at `similar_text` function in the [PHP manual](http://php.net/manual/en/function.similar-text.php). – Crisoforo Gaspar Oct 12 '14 at 20:14
  • Although it is very dirty, you can use `array_unique` on `$titluri` after your loop to get the expected array ? – Alban Pommeret Oct 12 '14 at 20:18
  • @AlbanPommeret, array_unique will not work, already tried it. – VA13 Oct 12 '14 at 20:44

2 Answers2

1

Here is my solution inspired by @DomWeldon with no duplicates :

 <?php
$titles = array(
    'Title1: Hackers expose trove of snagged Snapchat images',
    'Title2: New Jersey officials order symptom-less NBC News crew into Ebola quarantine',
    'Title3: Family says goodbye at funeral for 16-year-old',
    'Title4: New Jersey officials talk about Ebola quarantine',
    'Title5: New Far Cry 4 Trailer Welcomes You to Kyrat Lowlands',
    'Title6: Hackers expose Snapchat images'
);
$titluri    =   array(); // unless it's declared elsewhere
$duplicateTitles = array();
// loop through each line of the array
foreach ($titles as $key => $originalFeed)
{
    if(!in_array($key, $duplicateTitles)){
        $titluri[] = $originalFeed; // all feeds are listed in the new array
        $feed_A = explode(' ', $originalFeed);
        foreach ($titles as $newKey => $comparisonFeed)
        {
            // iterate through the array again and see if they intersect
            if ($key != $newKey) { // but don't compare same line against eachother!
                $feed_B = explode(' ', $comparisonFeed);
                $intersect_A_B = array_intersect($feed_A, $feed_B);
                // do they share three words?
                if(count($intersect_A_B)>3)
                {
                    // yes, add a diplicate entry
                    $titluri[]['duplicate'] = $comparisonFeed;
                    $duplicateTitles[] = $newKey;
                }
            }
        }
    }
}
Alban Pommeret
  • 327
  • 1
  • 10
  • Thank you, it seems to do the job very well. I will see if I can tweak it to integrate the code in a larger scheme. Good luck, Alban! – VA13 Oct 12 '14 at 20:57
  • this is another work-around (`in_array` will perform some internal loop anyway), it is of course better than the Dom Weldon's solution but I think we can use 2 for-loops (instead of 2 foreach) then the performance is even better. The first loop: `$i` from `0` to `< count($titles)-1`, the second loop: `$j` from `$i+1` to `< count($titles)`. However we may need more tweak to make it work (not simply by changing the loops). – King King Oct 12 '14 at 20:58
  • @KingKing, I have used 2 for-loops, just like you say, but in the old non-functional code '$i' and '$j=$i+1'. Will try to use it on Alban's solution, but tomorrow morning. Thanks for the tips! – VA13 Oct 12 '14 at 21:04
  • @VladAndrei just scan it again, using for-loop still requires you to check if some index has already been taken (as duplicate), however you should use a dedicated array to save those indices (instead of using `in_array` to check), that will perform better (because the searching is based on key, not on value) with a trade-off of more memory needed - but this is little. – King King Oct 12 '14 at 21:27
  • @VladAndrei here is the code I edited using 2 fors: `$dup = array(); for ($i=0;$i < count($titles)-1; $i++) { if($dup[$i]) continue; $titluri[] = $titles[$i]; $feed_A = explode(' ', $titles[$i]); for ($j=$i+1; $j3) { $titluri[]['duplicate'] = $titles[$j]; $dup[$j] = true; } } }` – King King Oct 12 '14 at 21:27
0

I think this code might be what you're looking for (included with comments). If not, let me know - this has been written in a hurry and is untested. Also, you may want to look at an alternative to this - the nested foreach loop is likely to cause performance issues on a big site.

<?php

$titles = array(
    'Title1: Hackers expose trove of snagged Snapchat images',
    'Title2: New Jersey officials order symptom-less NBC News crew into Ebola quarantine',
    'Title3: Family says goodbye at funeral for 16-year-old',
    'Title4: New Jersey officials talk about Ebola quarantine',
    'Title5: New Far Cry 4 Trailer Welcomes You to Kyrat Lowlands',
    'Title6: Hackers expose Snapchat images'
    );
$titluri    =   array(); // unless it's declared elsewhere
// loop through each line of the array
foreach ($titles as $key => $originalFeed)
{
    $titluri[] = $originalFeed; // all feeds are listed in the new array
    $feed_A = explode(' ', $originalFeed);
    foreach ($titles as $newKey => $comparisonFeed)
    {
        // iterate through the array again and see if they intersect
        if ($key != $newKey) { // but don't compare same line against eachother!
            $feed_B = explode(' ', $comparisonFeed);
            $intersect_A_B = array_intersect($feed_A, $feed_B);
            // do they share three words?
            if(count($intersect_A_B)>3)
            {
                // yes, add a diplicate entry
                $titluri[]['duplicate'] = $comparisonFeed; 
            }
        }
    }
}
Alban Pommeret
  • 327
  • 1
  • 10
Dom Weldon
  • 1,728
  • 1
  • 12
  • 24
  • Just replaced `$i` by `$newKey` and I think your code is good ! – Alban Pommeret Oct 12 '14 at 20:24
  • not sure if this works but it's not very efficient, such as the first round it compares `Title1` and `Title4`, then it will compare again `Title4` and `Title1`, which has almost the same result (the same for other pairs). Using a for loop (with a counter) should be better. – King King Oct 12 '14 at 20:25
  • You're right, @KingKing - this was written very quickly, please do edit! – Dom Weldon Oct 12 '14 at 20:33
  • using for-loop is of course better in performance but more complex to implement in this case (you can save some calls to `array_intersect`). My comment is there as a note for the OP, he may want to try it himself (it may really require some testing). – King King Oct 12 '14 at 20:36
  • @AlbanPommeret, your code does the job, but it duplicates some entries, as seen in the array (title1 grabs title6, and titl2 grabs title4, as they are similar, but then again title6 will also have title1 underneath it, and title4 will have title2 as well, which is a duplicate and i'm trying to avoid that. Just print_r the resulting array to get an idea of what I'm saying, please). – VA13 Oct 12 '14 at 20:40
  • @VladAndrei It wasn't my code, I just edited it in order to fix a comparison which didn't work :) – Alban Pommeret Oct 12 '14 at 20:46
  • ok, @AlbanPommeret, your proposed solution does the job partially, and thanks for that too. However, I am trying to find a solution which does filter and arrange the titles only once. – VA13 Oct 12 '14 at 20:50
  • @KingKing I just added an anser that fixes the issue I think. – Alban Pommeret Oct 12 '14 at 20:53