Finding duplicate 2D arrays in Matlab 3D array

Question

worldStates is a Matlab MxNxL 3D array (tensor) containing L states of a MxN grid of binary values.

ps is a length L list of probabilities associated with the different states.

The function [worldStates, ps] = StateMerge(worldStates, ps) should remove duplicate world states and sum the probabilities of the merged states to the single state that remains. Duplicate states are states with the exact same configuration of binary values.

Here is the current implementation of this function:

function [worldStates, ps] = StateMerge(worldStates, ps)
    
    M = containers.Map;
    
    for i = 1:length(ps)
        s = worldStates(:,:,i); 
        s = mat2str(s);
        if isKey(M, s)
            M(s) = M(s) + ps(i);
        else
            M(s) = ps(i);
        end
    end
    
    stringStates = keys(M);
    n = length(stringStates);
    
    sz = size(worldStates);
    worldStates = zeros([sz(1:2), n]);
    ps = zeros(1, 1, n);
    
    for i = 1:n
        worldStates(:,:,i) = eval(stringStates{i});
        ps(i) = M(stringStates{i});
    end
end

It uses a Map to be able to remove duplicates in O(L) time, using the states as keys and the probabilities as values. Since Matlab maps does not allow for general data structures as keys the states are converted into string representations to be used as keys and later converted back to arrays using the eval function.

It turns out this code is way to slow for my needs as i will want to process many states (magnitude ~10^6) many times (10^3). The problem lies in converting the matrix to a string which takes a substantial amount of time and scales poorly with state size. An example for small 25x25 states is given below:

How could i create keys in a more efficient manner? Is there another solution aside from using a map that would yield better results?

EDIT: Runnable code as requested. This example makes merges very unlikely:

worldStates = double(rand(25,25, 1000) > 0.5);

weights = rand(1,1, 1000);
ps = weights./sum(weights);

[worldStates, ps] = StateMerge(worldStates, ps);

In this example there will be lot's of merges:

worldStates = double(rand(25,25) > 0.5) .* ones(1,1,1000);
worldStates(1:2,1:2,:) = rand(2,2,1000) > 0.5;

weights = rand(1,1, 1000);
ps = weights./sum(weights);

[worldStates, ps] = StateMerge(worldStates, ps);

Get rid of the evil `eval()` function. It disables the JIT, i.e. slows down your code execution, and has a whole host of other problems associated with it. See [this answer of mine](https://stackoverflow.com/a/32467170/5211833) and references therein for more information. Could you please add a [mcve], i.e. code that *we* can run? Using `eval` and strings for otherwise purely numerical matrices doesn't sound very efficient. — Adriaan, Feb 04 '21 at 09:04
@Adriaan I added runnable example code as per your request. I know eval is considered evil but did not know how to encode the matrix effectively for the map and parse it back quickly. If you know how that is exactly what I am looking for. — Emil Jansson, Feb 04 '21 at 09:23
Also: If eval is the bottleneck, why is it not showing in the profiler? — Emil Jansson, Feb 04 '21 at 09:25
Because it shows up in a subtle way: by disabling the JIT (Just In Time, basically compilation) for the entire function. Thus the line where `eval` is might be fast, but other functions can be heavily impacted by the lack of JIT. — Adriaan, Feb 04 '21 at 09:39
I have little dictionary, map or hashing knowledge. That said, the question reads to me as if you're attempting to find duplicate pages (3rd dimension) in a 3D numerical matrix. That can be done much easier and faster by staying in numerics. — Adriaan, Feb 04 '21 at 09:42
Basically `all(A(:,:,1)==A(:,:,2),'all')` will compare two matrices for full equality (whould work for binary matrices), a simple loop over all possibilities then sufficies. Probably it can be made smarter by utilising the `'dim'` argument, or possibly smarter checking (e.g. `~any()` has the potential of terminating earlier upon non-equality than `all()`) — Adriaan, Feb 04 '21 at 09:49
The reason i went with a map was that this was the recommended solution when searching for algorithms online. Here is a example of a source recommending the use of hash tables for O(L) time complexity: https://afteracademy.com/blog/remove-duplicates-from-an-unsorted-array — Emil Jansson, Feb 04 '21 at 09:51
A loop over all possibilities as you suggest would result in time complexity O(L^2). This would be problematic as L will grow very large in my program. — Emil Jansson, Feb 04 '21 at 09:54
I will try to find an alternative to the eval function though since it could have unexpected effects on runtime as you pointed out. — Emil Jansson, Feb 04 '21 at 09:55
How big do `M` and `N` get in reality? How sparse is each `MxN` matrix (i.e. ratio of 1s to 0s)? — Wolfie, Feb 04 '21 at 10:18
@Wolfie M and N will stay under 100. Ratio of 1s to 0s will be varied 1/2 to 0. — Emil Jansson, Feb 04 '21 at 11:36
`containers.Map` is slow. If you can convert to a string that is a valid variable name (e.g. by computing a hash) then you can use a `struct`, which is a much more efficient dictionary. — Cris Luengo, Feb 04 '21 at 15:37

rahnema1 · Accepted Answer · 2021-02-05T16:32:41.093

6

Use unique to extract unique (merged) states and accumarray to sum the probabilities of the merged states. Note that this solution, like your solution, doesn't preserve the order of the original states. As suggested by @Wolfie in comments you can use unique with 'stable' option to preserve the order of the states:

function [worldStates, ps] = StateMerge(worldStates, ps)
    [M, N, L] = size (worldStates);
    worldStates1 = reshape(worldStates, M*N, L).';
    [~, uc, ui] = unique(worldStates1, 'rows');
    ps = accumarray(ui, ps(:));
    worldStates = worldStates (:, :, uc);
end

edited Feb 05 '21 at 16:32

answered Feb 04 '21 at 11:09

rahnema1

15,264
3
15
27

The solution works well with the caveat that I am receiving `ps` on the form [1,1,L] leading to an exception in accumarray. This was easily fixed with a reshape though and I was not completely clear on this in my question. – Emil Jansson Feb 04 '21 at 11:57
1

@EmilJansson I edited the answer reshaping `ps`. – rahnema1 Feb 04 '21 at 12:04
2

For reference this function is 10 times faster in the tests with L = 1000 and the positive effects seems to be even higher for larger L. – Emil Jansson Feb 04 '21 at 12:10
1

`accumarray` FTW :-) – Luis Mendo Feb 04 '21 at 17:33
1

Could you not add in some order preservation by using `unique` with the `'stable'` argument? – Wolfie Feb 05 '21 at 09:09
@Wolfie Thanks! I added your suggestion to the answer. – rahnema1 Feb 05 '21 at 16:33

Finding duplicate 2D arrays in Matlab 3D array

1 Answers1