9

I can't seem to figure out what magic is happening behind the PHP scene and why array_unique cannot detect my duplicates.

In my specific situation, I have 2 collections of users, which I am merging into one and then keeping only unique entries. For that I am converting both collections into arrays, array_merge() them and then based on parameter apply array_unique(..., SORT_REGULAR) so that they are compared as objects without any conversions. I realise that comparing objects is a slippery slope, but in this case it's weirder than I though.

After merge but before the uniqueness check I have this state: enter image description here

As you can see, items 4 and 11 are the same User entity (both non-strict and strict comparison agree on that). Yet after array_unique() they both remain in the list for some reason: enter image description here

As you can see, items 7-10 were detected and removed, but 11 wasn't.

How is that possible? What am I not seeing here?

Currently running PHP 7.4.5

Code is from project using Symfony 4.4.7 and Doctrine ORM 2.7.2 (although I think this should be irrelevant, if the objects are equal both by == and === comparisons).

Fun fact for bonus points - applying array_unique twice in a row gives actually unique results: enter image description here

Mind = blown

UPDATE: I have added throw new \RuntimeException() in my User::__toString() method, to be extra sure noone is doing conversion to string.

Please do not suggest converting to string - that is neither a solution to my problem, nor what this question is about.

mkilmanas
  • 3,395
  • 17
  • 27
  • Does https://stackoverflow.com/questions/2426557/array-unique-for-objects answer your question? – CBroe May 20 '20 at 10:36
  • @CBroe though OP uses `SORT_REGULAR` option and there's no cast to string, still `array_unique` is not supposed for this. – u_mulder May 20 '20 at 10:40
  • @CBroe no, it does not. As @u_mulder pointed out, I am already using SORT_REGULAR and it still gives me inexplicable results. I have even added `throw new \RumtimeException()` into `User::__toString()` to make sure no shenenigans are happening there. – mkilmanas May 20 '20 at 10:51
  • 1
    @u_mulder - can you elaborate on "still array_unique is not supposed for this"? Why not? – mkilmanas May 20 '20 at 10:51
  • 1
    array_unique compares strings. So make a check comparing objects string representation (string) $a === (string) $b – Vitali Protosovitski May 20 '20 at 10:51
  • 1
    So satisfyingly odd :) This might be a good read to start understanding it https://github.com/php/php-src/blob/50a9f511ccc8946551f8dcb573476e075dce330c/ext/standard/array.c#L4419 – β.εηοιτ.βε May 26 '20 at 19:29
  • 2
    Looks like the code creates the return value array (converting to strings where needed - see the upper conversion) and removes elements by index using a second array for comparison purposes (`arTmp` in the code). This second array uses *pointers* to the variables (see `cmpdata->b.val` where b is a pointer and so b.val is not the string representation) to find what to remove. This works as everything is removed by index. As for the second time you call the function, it works because this time you ARE passing in strings as this is what the first function returned. – Robbie May 27 '20 at 07:31
  • 1
    Interesting, @Robbie, indeed. Now what I wonder is 1. I tried to reproduce, with stupid made of example, the OP issue, but can't, I wonder what makes the OP case unique here 2. I also wonder what make the pair (4,11) different than the other pairs remove from the first pass in `array_unique`. So really I am missing a reproducible example here :/ – β.εηοιτ.βε May 28 '20 at 12:23
  • 1
    Having also tried this (PHP 7.4.3 - couple of minor versions back from OP) I also get the expected behaviour, not what the OP is reporting. There could be something funky about the Collection->toArray() in the OPs code, but I also tried using SplFixedArray->toArray and \Ds\Collection->toArray() still got the expected result. I do see that the array comparison code change recently (e.g. https://github.com/php/php-src/commit/33ef3d64dac366733f2af40d5bce2bac4e5bca1e#diff-497f073aa1ab88afcb8b248fc25d2a12) - OP could have found a bug? – Robbie May 29 '20 at 01:33
  • On my side I tested on 7.4.6 (latest on the docker image) and 7.4.5 to be sure the OP did not had a bug on that specifique version – β.εηοιτ.βε May 29 '20 at 07:20
  • I like where this discussion is going. I've tried looking at the php-src but I'm neither a C programmer (only did a tiny bit of C++ a long while ago) nor ever looked into php internals, so making sense of what is happening there is quite tough. @Robbie - when saying it's doing string conversions where needed - do you see that happening for SORT_REGULAR? I couldn't find anything like that. Also, if the result is the same array with some keys unset (or is that understanding wrong?) - doesn't that mean I am passing the same array for the second call? – mkilmanas May 29 '20 at 12:19
  • P.S. For trying out on different versions, I really recommend https://3v4l.org/ - it runs lots of different versions in parallel and tells you if the results are different – mkilmanas May 29 '20 at 12:22
  • [OFF]: What tool to building you are using? – Felipe Gustavo May 29 '20 at 13:26
  • @mkilmanas if those are not too sensible data, would you maybe mind doing a `dump()` of Users in 4 and 11 and reproduce those in your question? – β.εηοιτ.βε May 29 '20 at 16:05
  • does $a->toArray() returns values or references? – ts. Jun 01 '20 at 07:30
  • @β.εηοιτ.βε - due to loads of relations I cannot even var_export the User object - the best I could give is an anonymized screenshot of what that particular object looks like --> https://imgur.com/a/bHegq7C @ts. - Not sure if I get your question - `toArray()` returns an array, containing all these User objects. Since they are objects, technically the answer is 'references', but in PHP (just like in Java, I suppose) there is not much difference between "value" and "reference" when talking about objects (or "values" as such don't exist at all - depending on your interpretation) – mkilmanas Jun 01 '20 at 10:50
  • What I have managed to figure out in terms of reproducing: if I take all the same user entities, load them from DB one by one by their id, and put them into the collection in exactly same order - array_unique works as expected. But if I replace one User object with it's proxy object (like #6 in my original screenshot), then suddenly it throws the sorting/comparison off and I get the same result - object ID=7 (or item #4/11) twice despite it being the SAME object. – mkilmanas Jun 01 '20 at 11:10
  • When I added one extra (the same) user to both collections, I got item #11 to be kept despite being the same as item #3 (which is User ID=6, i.e. a different object than before). When I added two extra users, I got result with #16===#7. Yet, when I swapped the order of the same users (moved the last user 2 positions up), then result is with no duplicates. So I don't even know anymore what to make of this - it's somehow related to Proxy objects, it depends on the sequence order of the input rather than particular object data, but cannot see any clear pattern so far – mkilmanas Jun 01 '20 at 11:24
  • _due to loads of relations I cannot even var_export the User object_: this is why you should use Symfony's [`dump()`](https://symfony.com/doc/current/components/var_dumper.html#the-dump-function) because it is build to be protected against reference loops – β.εηοιτ.βε Jun 01 '20 at 12:03
  • @ts `toArray()` return the private array representing the elements of the ArrayCollection. see https://github.com/doctrine/collections/blob/a4504c79efd8847cc77d10b70209ef838b10338f/lib/Doctrine/Common/Collections/ArrayCollection.php#L69 – β.εηοιτ.βε Jun 01 '20 at 12:07
  • 1
    @mkilmanas I really think the Proxy is your issue. Because the underlaying c code of PHP seems to order the elements by match then loop through them and remove elements when the item looped on and the previous element that was looped on are similar. By introducing a different object in there you might break this mechanism. – β.εηοιτ.βε Jun 01 '20 at 12:11
  • Now, this said, there are plenty of good way to do this in a "more Symfony" way. Would you be interested in a different approach or are you just trying to understand the odds of PHP? :) – β.εηοιτ.βε Jun 01 '20 at 12:13
  • 1
    Rather than trying to fight against a wonky builtin function that will _never_ be able to do what you're expecting, why not instead create a new Collection with a unique constraint and add the member items to _that_. – Sammitch Jun 01 '20 at 20:52
  • @β.εηοιτ.βε - most of all, I would love to understand why the PHP is behaving this way, and see if this potentially is a bug. Failing that, a "more Symfony" solution would also be interesting, as the only workaround so far seems to be applying the function twice. But without understanding why the first one is sometimes failing, I cannot be sure that double application will be fail-proof. – mkilmanas Jun 01 '20 at 21:42
  • 1
    @β.εηοιτ.βε You are right about sorting and proxy -- here I've tried to do `sort($items, SORT_REGULAR)` and see what it sees at the intermediate step https://imgur.com/a/MrXqSkM . Sure enough, after "sorting" the proxy object got positioned between the two identical objects. So if the removal happens by checking consecutive elements, it's pretty clear how that fails. Note the strange comparisons between items 2/3/4 (after sorting) - 2 < 3 < 4, but 2 === 4. I have a feeling, that the "order" of objects from different classes is not well defined. – mkilmanas Jun 01 '20 at 21:49
  • 2
    There's actually a pretty explicit warning in the [documentation](https://www.php.net/manual/en/function.sort.php): _"Be careful when sorting arrays with mixed types values because sort() can produce unexpected results, if sort_flags is SORT_REGULAR"_ – Marvin Jun 01 '20 at 23:19
  • 2
    @Marvin thanks, have not seen that warning myself (mostly because I was digging into array_unique and did not realize until yesterday that sort plays a major role there. I'd say this comment of yours answers about 50% of the whole question at hand – mkilmanas Jun 02 '20 at 09:39

3 Answers3

2

For your issue at hand, I am really suspecting this is coming from the way array_unique is removing elements out of the array, when using the SORT_REGULAR flag, by:

  1. sorting it
  2. removing adjacent items if they are equal

And because you do have a Proxy object in the middle of your User collection, this might cause you the issue you are currently facing.

This seems to be backed up by the warning of the sort page of PHP documentation, as pointed out be Marvin's comment.

Warning Be careful when sorting arrays with mixed types values because sort() can produce unexpected results, if sort_flags is SORT_REGULAR.

Source: https://www.php.net/manual/en/function.sort.php#refsect1-function.sort-notes


Now for a possible solution, this might get you something more Symfony flavoured.

It uses the ArrayCollection filter and contains methods in order to filter the second collection and only add the elements not present already in the first collection.
And to be fully complete, this solution is also making use of the use language construct in order to pass the second ArrayCollection to the closure function needed by filter.

This will result in a new ArrayCollection containing no duplicated user.

public static function merge(Collection $a, Collection $b, bool $unique = false): Collection {
  if($unique){
    return new ArrayCollection(
      array_merge(
        $a->toArray(),
        $b->filter(function($item) use ($a){
          return !$a->contains($item);
        })->toArray()
      )
    );
  }

  return new ArrayCollection(array_merge($a->toArray(), $b->toArray()));
}
β.εηοιτ.βε
  • 33,893
  • 13
  • 69
  • 83
  • 1
    Thanks, I'll take this as an accepted answer as it is the closest to what I'm after. This and @Marvin's comment about a warning when SORT_REGULAR sorting array of objects of different types. When digging through the PHP source code and internal docs, I got as far as object zval _zend_object_value and how it consists of handle and handler table, how each class can have different handler table, yet some classes can share some handlers, and I suppose for sort/array_unique to work, they need to share the 'compare' handler. I just couldn't find more info about where/what/which those handlers are. – mkilmanas Jun 02 '20 at 09:38
  • 1
    I followed that question since the beginning but also only realized the sorting issues after @β.εηοιτ.βε's comments. So... well deserved bounty. – Marvin Jun 02 '20 at 12:46
0

I know that you said that you don't want converting to string, but i see that you are yet no have way out, so i propose to you use the function serialize to each object in your array, i don't found a method to compare objects that isn't converting in array or string (you cant try convert in binary or hex if you don't unfamiliar with string or array, but i don't know if you can converting to binary or hex without to convert in string).

But, if you use serialize, you can serialize the object in a read data own of php, to you compare with anothers serialized objects, this method (serialize) is safe, because you can do aunserialize, and geting the original object again.

So you can serialize all elements from array and after this, you can use array_unique, like that:

<?php

header("Content-Type: application/json");

class MyClass
{
    public $var1;
    public $var2;
    function __construct($var1, $var2)
    {
        $this->var1 = $var1;
        $this->var2 = $var2;
    }

}

$arr = [
    "a",
    "a",
    [1,2,3],
    "b",
    [1,2,3],
    new MyClass(1,1),
    new MyClass(1,new MyClass(1,1)),
    new MyClass(1,new MyClass(1,1)),
];

$arrSerilized = array_map("serialize", $arr);

var_dump(
    array_map(
        "unserialize",
        array_unique(
            $arrSerilized,
            SORT_STRING
        )
    )
);

/* output:
array(5) {
    [0]=>
    string(1) "a"
    [2]=>
    array(3) {
        [0]=>
        int(1)
        [1]=>
        int(2)
        [2]=>
        int(3)
    }
    [3]=>
    string(1) "b"
    [5]=>
    object(MyClass)#6 (2) {
        ["var1"]=>
        int(1)
        ["var2"]=>
        int(1)
    }
    [6]=>
    object(MyClass)#7 (2) {
        ["var1"]=>
        int(1)
        ["var2"]=>
        object(MyClass)#8 (2) {
            ["var1"]=>
            int(1)
            ["var2"]=>
            int(1)
        }
    }
}
*/

Hope this help you man, have a good day!

P.S.: With serialize you can preserve same value in different variable type, like 1 and "1" are serialized in different read data of php

  • 1
    I don't think this will work in the OP's context. Symfony User is something sepcific that already have a serialize function in order to put the User connected in the session. If the OP have to make all fields of the User in the serialization and so in the session, that wouldn't really be ideal – β.εηοιτ.βε May 29 '20 at 16:02
0

Without knowing about your entity class its hard to guess why this is happening. But I guess your main issue here is __toString() method . If you have not defined it, you should add one such that it returns a unique/distinct string for each entity object. If its already defined make sure it returns distinct string.

class User{ 
   private $name;

   function __construct($name){ 
      $this->name=$name;
   }

   function __toString(){ 
     return $this->name; 
   }
}

$user = [];
$users[] = new User("User1");
$users[] = new User("User2");
$users[] = new User("User3");

$user1= $users[0];
$users[]=$user1; //duplicate

echo(count(array_unique($users))); // output should be 3

Given the limited information about entity class I can guess this far.

Edit:

After reading your edits I guess you are locking yourself into this. Since array_unique will try to convert an entity object to either string or number depending on the sort_flag you pass. More on array_unique. So either you need to implement __toString() or add some public properties which define the uniqueness of your object to entity e.g

class User{ 
       public $id;
       private $name;

       function __construct($id,$name){
          $this->id=$id;
          $this->name=$name;
       }
}

$user = [];
$users[] = new User(1,"User1");
$users[] = new User(2,"User2");
$users[] = new User(3,"User3");

$user1= $users[0];
$users[]=$user1; //duplicate
echo(count(array_unique($users, SORT_REGULAR))); // output should be 3

Please note the public property $id and SORT_REGULAR flag.

sakhunzai
  • 13,900
  • 23
  • 98
  • 159
  • Your edit actually proves the point of the OP, array_unique, when two items are the same, using `SORT_REGULAR` should work, but it is not, for some reason – β.εηοιτ.βε Jun 01 '20 at 18:32
  • @β.εηοιτ.βε may be we need more info about `User` and its public properties. – sakhunzai Jun 02 '20 at 02:39
  • @sakhunzai public properties do not seem to play a major role here, as two entries in the array are THE SAME object (unless you know something that I don't about how `sort($items, SORT_REUGAL)` works internally, especially with array of objects from 2 different classes) – mkilmanas Jun 02 '20 at 09:46