How to check whether one array is a correctly ordered subset of another

Question

I have a few arrays of strings like:

$big = ['html', 'body', 'div', 'table', 'tbody', 'tr', 'td'];
$small = ['body', 'div', 'td'];
$wrong = ['td', 'body', 'div'];

I need to check whether $small and $wrong are to be found in $big. However I need the order to be the same. So my function should return true for $small and false for $wrong. It should be fairly trivial to do it manually but I need the code to be fast. So ideally if there is a built in that achieves this I would rather use that.

So the question is mainly whether such a built in exists. Here is the code I have come up with in case it doesn't:

/**
 * Returns whether the substack is contained in the stack in the correct order.  
 * 
 * @param string[] $stack           The substack to check
 * @param string[] $subStack        The substack to check 
 * @return bool
 */
function stackInStack(array $stack, array $subStack)
{

    // First let's do a simple array diff to save time on an ordered diff;
    // TODO: Check if this actually improves average performance.
    if (count(array_diff($subStack, $stack)) !== 0) return false;

    $stackSize = count($stack);
    $subStackSize = count($subStack);

    $stackIndex = 0;
    for ($subIndex = 0; $subIndex < $subStackSize; $subIndex++) {

        while (
            $stackIndex < $stackSize &&
            $stack[$stackIndex] !== $subStack[$subIndex]
        ) {
            $stackIndex++;
        }

        if ($stackIndex == $stackSize) {
            if ($subIndex <= $subStackSize - 1) {
                return false;
            } elseif ($subIndex > $subStackSize - 1) {
                throw new Exception('Very Strange Exception: subIndex has outgrown subStacksize');
            }
        } elseif ($stackIndex > $stackSize) {
            throw new Exception('Very Strange Exception: index has outgrown stacksize');
            break;
        } 

    }
    return true;
}

Provided a built in doesn't exist or is slow, any tips to improve the efficiency of the above code (other than to rewrite it in c) would also be greatly appreciated.

Hello? Did you give up??? – AbraCadaver Dec 07 '19 at 00:27 — AbraCadaver, Dec 07 '19 at 00:27

Jeremy Harris · Answer 1 · 2019-11-15T13:03:02.687

2

Assuming your arrays are not too big, you could use string comparisons instead. Something like this:

<?php

    $big = ['html', 'body', 'div', 'table', 'tbody', 'tr', 'td'];
    //$small = ['body', 'div', 'td']; // This is the original
    $small = ['body', 'div', 'table']; // This is for testing
    $wrong = ['td', 'body', 'div'];

    $bigToken = implode($big, ''); // Output: htmlbodydivtabletbodytrtd
    $smallToken = implode($small, ''); // Output: bodydivtable
    $wrongToken = implode($wrong, ''); // Output: tdbodydiv

    if (stristr($bigToken, $smallToken) !== false) {
        echo("Small is in big!");
    }
    elseif (stristr($bigToken, $wrongToken) !== false) {
        echo("Wrong is in big!");   
    }
    else {
        echo("No match found :)");
    }

?>

It basically converts the array to a string and checks if another string is contained inside. Performance wise, that will all depend on how large your actual arrays are, but this ensures the proper order and is easier to maintain.

As pointed out in the comments below, it would be a good idea to implode with some sort of token to ensure you are correctly separating the tags in instances where there may be a conflict.

edited Nov 15 '19 at 13:03

answered Nov 14 '19 at 15:08

Jeremy Harris

24,318
13
79
133

I think substr_count() is faily faster than stristr() – Oliver M Grech Nov 14 '19 at 15:10
nice approach btw :) – Oliver M Grech Nov 14 '19 at 15:10
Thank you, yeah that is probably true. I have never benchmarked it. – Jeremy Harris Nov 14 '19 at 15:13
1

For those who want to optimize this https://stackoverflow.com/questions/20269745/php-what-is-the-fastest-way-to-count-strings-occurrences – Oliver M Grech Nov 14 '19 at 15:18
Maybe a separator would be a good idea, as currently there is nothing to check if something like `tbody` and `t`,`body` would currently be the same string. – Nigel Ren Nov 14 '19 at 15:41
I was ready to dismiss this altogether until I noticed that all the other answers were providing false positives for my use case. It isn't really what I asked for. But it turns out what I asked for wasn't what I needed. It benchmarks better than all the other answers too. 16% better than the best alternative which is a combination of mine and @vivek_23's. Probably because it achieves an easier task. The differences between strpos, strstr and substr_count are more or less negligable, though strpos is ever so slightly faster. – kaan_a Nov 15 '19 at 13:51
It's good at least as a stop-gap solution. Eventually I might replace it with something more sophisticated. But in order to avoid false positives I will need better qualified inputs (for example ['div>', 'table'] vs ['div', 'table'] a la CSS or ['div', 'table'] vs ['div', '/table'] a la xpath) and this will require a significantly more complicated function than anything anyone has written here. Anyways I'm torn on whose answer to accept now. @vivek_23 answered the actual question I asked but I ended up using your code. – kaan_a Nov 15 '19 at 13:53
I am glad it helped. You won't offend me if you don't choose my answer :) – Jeremy Harris Nov 15 '19 at 13:57

Nigel Ren · Answer 2 · 2019-11-14T15:15:27.803

2

This is a bit shorter than your version, it uses array_intersect() to work out the common elements in the two arrays and then compares the result with the sub stack to see if they are the same...

$big = ['html', 'body', 'div', 'table', 'tbody', 'tr', 'td'];
$small = ['body', 'div', 'td'];
$wrong = ['td', 'body', 'div'];

function stackInStack(array $stack, array $subStack)
{
    return array_values(array_intersect($stack, $subStack)) == $subStack;
}

var_dump(stackInStack($big, $small));
var_dump(stackInStack($big, $wrong));

Just to show what I mean

print_r(array_intersect($big, $wrong));

gives...

Array
(
    [1] => body
    [2] => div
    [6] => td
)

so compare this against $wrong and it isn't in the same order.

edited Nov 14 '19 at 15:15

answered Nov 14 '19 at 15:09

Nigel Ren

56,122
11
43
55

1

What if `$big` had duplicates? `$big = ['html', 'body', 'div', 'body','table', 'tbody', 'tr', 'td'];` – nice_dev Nov 14 '19 at 15:32
@vivek_23, how would you then know which one you want to match with the sub array? `$small = ['body', 'div'];` becomes ambiguous. – Nigel Ren Nov 14 '19 at 15:33
I assume OP just wants the sequence to be same as `$small`, meaning test if `$small` is a subsequence of `$big`. It's ok t match the first `body` than the second `body`. – nice_dev Nov 14 '19 at 15:36
@vivek_23, surely then the second `body` is irrelevant as the first one would already indicate the sequence is correct - not sure how the second one changes the validity of any of the examples. – Nigel Ren Nov 14 '19 at 15:38
I mean array_intersect in this context would return false. https://3v4l.org/2qjdt – nice_dev Nov 14 '19 at 15:41
@vivek_23, but the sequence of tokens is different - by the same logic I would expect `$small = ['body', 'div', 'body', 'td'];` – Nigel Ren Nov 14 '19 at 15:43
For subsequence matching, it's ok to match the first `body` to make the sequential matching in order. I think OP needs to clarify this soon to avoid the confusion. – nice_dev Nov 14 '19 at 15:45
If there are 2 different `body` in `$small`, then surely they are 2 different tokens. I had 2 different tokens in `$big` in my first comment. – nice_dev Nov 14 '19 at 15:47
1

This is for something very much like a css or xpath selector, so @vivek_23 is correct. The stack is all the xml nodes up to and including the current one being processed by XMLReader and the substack is a parameter that determines whether the current node will be stored. I didn't think the context would be very relevant and thought it would be better to generalize the question as much as possible. The array_intesect method is very clever, perhaps it will be of some use to someone else, but for my use case it outputs false negatives. I'll probably benchmark tommorrow anyways, out of curiosity. – kaan_a Nov 14 '19 at 19:33
Ok so after benchmarking I have to say, your answer is not so great. I tested 90 typical scenarios and it performs about 40% slower on average than @vivek_23's solution (which performs identical to mine when you remove the early checks). So I would definitely avoid this regardless of the use case, unless performance isn't an issue and you are close to a deadline – kaan_a Nov 15 '19 at 06:57
1

One thing I would point out is that if you are reading HTML, XMLReader may not be able to cope with a lot of the stuff you see out in the real world. Anything slightly broken may cause issues as XHTML is a lot stricter than HTML. There are also use cases where simple matching a hierarchy doesn't work (I did something similar for a JSON streaming parser). – Nigel Ren Nov 15 '19 at 07:01
Sorry, meant that on the question :-( – AbraCadaver Nov 18 '19 at 14:35

score 1 · Answer 3 · answered Nov 14 '19 at 15:17

1

This is pretty much a duplicate of Nigel Ren's, but I had typed it up in another window.

Compute the intersection which will be in the order of the first array, re-index both and compare the intersection with the original:

$result = array_values(array_intersect($big, $small)) === array_values($small);
// true

$result = array_values(array_intersect($big, $wrong)) === array_values($wrong);
// false

answered Nov 14 '19 at 15:17

AbraCadaver

78,200
7
66
87

1

What if `$big` had duplicates? `$big = ['html', 'body', 'div', 'body','table', 'tbody', 'tr', 'td'];` – nice_dev Nov 14 '19 at 15:32

score 1 · Accepted Answer · answered Nov 14 '19 at 15:44

1

Just use simple pointer to iterate over $needle which is the smaller array to be found as a subsequence in a bigger array. Once match is found, increment the pointer, else keep moving. If the pointer reaches the length of the smaller array, bingo, it's a subsequence, else it isn't.

Snippet:

<?php

$big = ['html', 'body', 'div', 'body','table', 'tbody', 'tr', 'td'];
$small = ['body', 'div', 'td'];
$wrong = ['td', 'body', 'div'];


function isSubsequence($haystack,$needle){
    if(count($needle) > count($haystack)) return false;
    $keys = array_keys($needle);
    $ptr = 0;
    $len = count($needle);

    foreach($haystack as $element){
        if($ptr === $len) return true;
        if($needle[$keys[$ptr]] === $element) $ptr++;
    }

    return $ptr === $len;
}


var_dump(isSubsequence($big, $small));
var_dump(isSubsequence($big, $wrong));

Demo: https://3v4l.org/vqBMj

answered Nov 14 '19 at 15:44

nice_dev

17,053
2
21
35

Your solution has less lines and is easier to read and understand than mine. I believe it also takes care of associative arrays, which mine will probably fail for (haven't tested either). I'm going to accept your answer. But you should modify it to take into consideration my findings regarding how expensive count() is in the next comment. In the end I will probably go for a variant of your solution that uses a sequence instead of an array, provided it benchmarks better. – kaan_a Nov 15 '19 at 07:29
So I did some benchmarks. Initially your solution tested 7% slower. When I removed all the early conditions they tested identically. When I added my initial conditional into your function, yours performed between 2-3% faster than mine. It turns out that count is very expensive and array_diff is not as expensive as one might think... – kaan_a Nov 15 '19 at 07:31
Initially I thought array diff was actually more expensive than count but the the situation where haystack is smaller than needle is atypical (which is why I didn't even test for it in the beginning) and that's why yours is running slower. But when I added 10% such situations to my testing data your function started performing 20% worse than mine. Which I think proves array_diff is actually faster than somehow – kaan_a Nov 15 '19 at 07:32
@Kaan Interesting metrics. What happens if you remove this early check in my code `if(count($needle) > count($haystack)) return false;`? Also, `count()` in PHP happens in O(1) time. See here https://stackoverflow.com/questions/5835241/is-phps-count-function-o1-or-on-for-arrays/5835419. – nice_dev Nov 15 '19 at 07:46
@Kaan Also, arrays are pass by value in PHP. So, the function parameters are a new copy of the passed arrays. Does changing the function definition from `function isSubsequence($haystack,$needle){` to `function isSubsequence(&$haystack,&$needle){` improve performance? Because here we are instructing it to use the same array rather than making a new copy. – nice_dev Nov 15 '19 at 07:48
Removing the conditional improves performance by about 2%. Oddly enough passing by reference makes it slightly slower (about 1%). I found this: https://stackoverflow.com/a/9740541/1788771. So in your function the arrays are passed by reference whether you prefix the params or not and the & is just adding complexity to the compile step. Please note that I'm testing all this with the CLI. So there would be no difference at all if this was running as a module and the vm code was cached. – kaan_a Nov 15 '19 at 08:02
I would like to reiterate that in a very typical case where a selector is being checked against a completely different part of the document than intended array_diff makes a huge difference, speeding up the execution by 10% on avarage (so including cases where it matches and it has to check sentimentality). So you should add that as an optional section in your answer – kaan_a Nov 15 '19 at 08:03
@Kaan Ohh, I wasn't aware about the [copy-on-write strategy](https://www.php.net/manual/en/internals2.variables.intro.php) in PHP as you pointed out. Nice to learn something new. Can you share the test samples and the metrics via some link? Also, do you want me to add `array_diff` as an option? Also that [array_diff](https://github.com/php/php-src/blob/03b78335583033d5d347dc9b667eccabed94ca24/ext/standard/array.c#L5461) loops through elements anyway. Not sure how it makes it any faster. – nice_dev Nov 15 '19 at 09:37

Federico · Answer 5 · 2019-11-15T13:52:20.810

this should work also on associative arrays:

"array_search()" will return the key corresponding to the value, so u can see if the sequence is right...

function is_array_in_array($array, $inarray){
    $pos = -1; $match = 0;
    $array = array_values($array);
    $inarray = array_values($inarray);
    foreach($array as $v){
        $p = array_search($v, $inarray);
        if($p===false || $p<=$pos) return false;
        $pos = $p; $match++;
    }
    if(count($array)==$match) return true;
    return false;
}

And it will return:

$small_assoc = array('A'=>'body', 'B'=>'div', 'C'=>'td');

is_array_in_array($small, $big);  # =1
is_array_in_array($wrong, $big);  # =0
is_array_in_array($big, $big);    # =1
is_array_in_array($small_assoc, $big); # =1

How to check whether one array is a correctly ordered subset of another

5 Answers5