0

Below are two arrays from two different feeds, they share different ids. Because of this, I have to rely on 'BriefTitle': I can tell by the 'BriefTitle' and other data (eg [LocationCountry], [StartDate], [Condition]) that this is same record. I would like to take substr of 'BriefTitle' to compare it to other 'BriefTitle' records to filter out duplicates, since they are contained in each other. I am not looking for an exact match, which is what I've been finding for most solutions here.

I like the short solution proposed by sevavietl/ mickmackusa: php remove duplicates from multidimensional array by value

$result = array_reverse(array_values(array_column(
    array_reverse($data),
    null,
    'BriefTitle'
)));

however, my 'BriefTitle' is an array (doesn't seem to work with array_column), and I am not sure how to apply substr function to the solution above.

Some quick notes:

  • Fortunately, [BriefTitle][0] is always the value to compare
  • If possible, I would like just grab the first instance for the data set, rejecting any following duplicates.

Any thoughts how I should approach this? The arrays:

 [0] => Array
        (
            [Rank] => 422
            [id] => Array
                (
                    [0] => 152091
                )

            [Condition] => Array
                (
                    [0] => Depression
                    [1] => Ketamine
                )

            [BriefTitle] => Array
                (
                    [0] => Positron Emission Tomography Assessment of Ketamine Binding of the Serotonin Transporter
                )

            [LocationCountry] => Array
                (
                    [0] => Austria
                )

            [StartDate] => Array
                (
                    [0] => May 5, 2016
                )

            [LastUpdatePostDate] => Array
                (
                    [0] => October 15, 2018
                )

            [Entheogen] => ketamine
            [Source] => clinicaltrials.gov
        )   


    [1] => Array
        (
            [Rank] => 6673
            [id] => Array
                (
                    [0] => YSBSZ18291
                )

            [Condition] => Array
                (
                    [0] => Depressive Disorder
                    [1] => Ketamine
                )

            [BriefTitle] => Array
                (
                    [0] => Positron Emission Tomography assessment of Ketamine Binding of the Serotonin Transporter and its Relevance for Rapid Antidepressant Response
                    [1] => Die Rolle des Serotonintransporters bei der akuten antidepressiven Wirkung von Ketamin, untersucht mit Positronen-Emissions-Tomographie
                )

            [LocationCountry] => Array
                (
                    [0] => Austria
                )

            [StartDate] => Array
                (
                    [0] => 2016 05 01
                )

            [LastUpdatePostDate] => Array
                (
                    [0] => 2018 10 15
                )

            [Entheogen] => ketamine
            [Source] => clinicaltrialsregister.eu
        )
Jonathan Hall
  • 75,165
  • 16
  • 143
  • 189
kraxn
  • 3
  • 5
  • Your input data appears to be two separate arrays, not one. Also, there are no similarities in the `BriefTitle` array, so no de-duplicating required. To get an answer to your question, you need to show data which actually has to be changed, and also show what the data should look like afterwards. – Nick Apr 22 '20 at 23:13
  • Hi Nick, thanks for the quick reply: 1. Fixed, error on my part. 2. No data changed. Need these to non-duplicated based on a partial match (character limits keep me for posting more here). `[0] => Positron Emission Tomography Assessment of Ketamine Binding of the Serotonin Transporter` `[0] => Positron Emission Tomography assessment of Ketamine Binding of the Serotonin Transporter and its Relevance for Rapid Antidepressant Response` – kraxn Apr 22 '20 at 23:39
  • (apologies for shoddy formatting, I am figuring out the markdown here) – kraxn Apr 22 '20 at 23:45
  • Thanks for the update - so how do you decide what is a match? – Nick Apr 22 '20 at 23:47
  • Hi Nick - quick background. This is from merged data for 2 different clinical study databases. What I noticed is that some are duplicate: the title gives it away - one title is shorter than the other (country/conditions also help confirm this). The [BriefTitle] shares the most unique common element, even though its not an exact match. Hope that makes sense. – kraxn Apr 22 '20 at 23:59
  • So is a matching title *always* just a substring of another? If so, which do you want to keep. – Nick Apr 23 '20 at 00:07
  • yes, in this case, always. I'd like to keep the first record. – kraxn Apr 23 '20 at 00:20

1 Answers1

1

Unfortunately because of the nature of your data (strings which match may be substrings of others, with different case) the only real option is to brute-force this. Loop over the array, storing titles as you go and checking whether the current title matches any of them:

$result = array();
$brieftitles = array();
foreach ($array as $arr) {
    $foundtitle = false;
    $title = $arr['BriefTitle'][0];
    foreach ($brieftitles as $btitle) {
        $foundtitle = (stripos($title, $btitle) !== false) || (stripos($btitle, $title) !== false);
        if ($foundtitle) break;
    }
    if (!$foundtitle) {
        $result[] = $arr;
        $brieftitles[] = $arr['BriefTitle'][0];
    }
}
print_r($result);

Demo on 3v4l.org

Nick
  • 138,499
  • 22
  • 57
  • 95