Filtering multidimensional arrays by non-equal values to exclude duplicated records

Question

Below are two arrays from two different feeds, they share different ids. Because of this, I have to rely on 'BriefTitle': I can tell by the 'BriefTitle' and other data (eg [LocationCountry], [StartDate], [Condition]) that this is same record. I would like to take substr of 'BriefTitle' to compare it to other 'BriefTitle' records to filter out duplicates, since they are contained in each other. I am not looking for an exact match, which is what I've been finding for most solutions here.

I like the short solution proposed by sevavietl/ mickmackusa: php remove duplicates from multidimensional array by value

$result = array_reverse(array_values(array_column(
    array_reverse($data),
    null,
    'BriefTitle'
)));

however, my 'BriefTitle' is an array (doesn't seem to work with array_column), and I am not sure how to apply substr function to the solution above.

Some quick notes:

Fortunately, [BriefTitle][0] is always the value to compare
If possible, I would like just grab the first instance for the data set, rejecting any following duplicates.

Any thoughts how I should approach this? The arrays:

Your input data appears to be two separate arrays, not one. Also, there are no similarities in the `BriefTitle` array, so no de-duplicating required. To get an answer to your question, you need to show data which actually has to be changed, and also show what the data should look like afterwards. — Nick, Apr 22 '20 at 23:13
Hi Nick, thanks for the quick reply: 1. Fixed, error on my part. 2. No data changed. Need these to non-duplicated based on a partial match (character limits keep me for posting more here). `[0] => Positron Emission Tomography Assessment of Ketamine Binding of the Serotonin Transporter` `[0] => Positron Emission Tomography assessment of Ketamine Binding of the Serotonin Transporter and its Relevance for Rapid Antidepressant Response` — kraxn, Apr 22 '20 at 23:39
(apologies for shoddy formatting, I am figuring out the markdown here) — kraxn, Apr 22 '20 at 23:45
Thanks for the update - so how do you decide what is a match? — Nick, Apr 22 '20 at 23:47
Hi Nick - quick background. This is from merged data for 2 different clinical study databases. What I noticed is that some are duplicate: the title gives it away - one title is shorter than the other (country/conditions also help confirm this). The [BriefTitle] shares the most unique common element, even though its not an exact match. Hope that makes sense. — kraxn, Apr 22 '20 at 23:59
So is a matching title *always* just a substring of another? If so, which do you want to keep. — Nick, Apr 23 '20 at 00:07
yes, in this case, always. I'd like to keep the first record. — kraxn, Apr 23 '20 at 00:20

score 1 · Accepted Answer · answered Apr 23 '20 at 00:40

Unfortunately because of the nature of your data (strings which match may be substrings of others, with different case) the only real option is to brute-force this. Loop over the array, storing titles as you go and checking whether the current title matches any of them:

$result = array();
$brieftitles = array();
foreach ($array as $arr) {
    $foundtitle = false;
    $title = $arr['BriefTitle'][0];
    foreach ($brieftitles as $btitle) {
        $foundtitle = (stripos($title, $btitle) !== false) || (stripos($btitle, $title) !== false);
        if ($foundtitle) break;
    }
    if (!$foundtitle) {
        $result[] = $arr;
        $brieftitles[] = $arr['BriefTitle'][0];
    }
}
print_r($result);

Demo on 3v4l.org

Filtering multidimensional arrays by non-equal values to exclude duplicated records

1 Answers1