1

For the following typical case:

n = 1000000;
r = randi(n,n,2);

(assume there are 0.05% common numbers between all rows; n could be even tens of millions)

I am looking for a CPU and Memory efficient solution to merge rows based on any common items (here integer numbers). A list of sample codes in Python is available here and a quick try to translate one into Matlab can be found here.

In my attempt they take ages (minutes to hours), so I am in favor of finding faster solution.

For the above example, the typical output should look like (cell):

{
[1 90 34 67 ... 9]
[35 89]
[45000 23 828 130 8999 45326 ... 11]
...
}

Note also that, I have tried to compile as mex but failed due to no-support for cell in Matlab-Coder.

Edit: A tiny demonstration example

%---------------------------------------
clc
n = 100;
r = randi(n,n,2);        % random integers in [1,n], size(n,2)
%---------------------------------------
>> r
r =
    82    17             % (1) 82 17
    91    13             % (2) 91 13
    13    32             % (3) 91 13 32            merged with (2), common 13
    82    53             % (4) 82 17 53            merged with (1), common 82
    64    17             % (5) 82 17 53 64         merged with (4), common 17
    ...
    94    45
    13    31             % (77) 91 13 32 31        merged with (3), common 13
    57    51
    47    52
     2    13             % (80) 91 13 32 31 2      merged with (77), common 13
    34    80
%---------------------------------------
c = merge(r);            % cpu and memory friendly solution is searched for.
%---------------------------------------
c =
    [82 17 53 64]
    [91 13 32 31 2]
    ...
Community
  • 1
  • 1
Developer
  • 8,258
  • 8
  • 49
  • 58
  • You need to define what "merge" means here. Or give the exact input that produces that example output – Luis Mendo Mar 23 '17 at 11:33
  • @LuisMendo In the newly added demonstration, for the given data the product is shown. So the question becomes much clear. Pay attention to the comments in the demonstration code. – Developer Mar 23 '17 at 12:39
  • I see. Interesting question, but tricky – Luis Mendo Mar 23 '17 at 12:41
  • 1
    You're searching for connected components in an undirected graph. If you have 2015b or later, you can try [graph.conncomp](https://www.mathworks.com/help/matlab/ref/graph.conncomp.html), but I don't know how fast it is. – beaker Mar 23 '17 at 17:17
  • @beaker I see, unfortunately it is a lower version of Matlab and it seems there is no `graph` toolbox or `conncomp` function available to me. It therefore would be great if someone could contribute to answer with a pure Matlab implementation of `conncomp` so no need for any other third party or other licenses. – Developer Mar 25 '17 at 14:25
  • @Developer Fortunately, you can do this with a simple BFS. You can find several implementations on File Exchange or write your own. I'll try to have a look when I get a bit more time. – beaker Mar 27 '17 at 13:57
  • Here are a couple of approaches: http://stackoverflow.com/questions/16883367/how-to-find-connected-components-in-matlab – beaker Mar 27 '17 at 14:35

1 Answers1

0

You need an index.

In Python, use a dict. In MATLAB - I'd not use MATLAB, because open-source is the future, and MATLAB is dying out.

But Python is quite slow. You can likely get a 10x speedup by using e.g. Cython to translate and optimize the code in C. Avoid using Python data types such as a list of int, because they are very memory intensive. numpy has memory-efficient arrays of integer.

If you get a new pair (a,b) you can use this dictionary to find existing items to merge. Then update the dict after the merge. Actually for integers, you should use an array instead of a dict.

The trickiest part is handling the case when both a and b exist, but are large different groups. There are some neat optimizations possible here, if that isn't fast enough yet.

It's not clustering, but connected components.

Has QUIT--Anony-Mousse
  • 76,138
  • 12
  • 138
  • 194
  • You're probably correct about the open source part. However, this is a question asking for hints or guidance or experience in implementing a solution for the described question in Matlab. I am curious to see a pure Matlab or Python (so I can translate it into Matlab) implementation of connected components. – Developer Mar 25 '17 at 14:29
  • I probably have to dig in through `scipy.sparse.csgraph.connected_components` to see if there is simple implementation out there. – Developer Mar 25 '17 at 14:33
  • Simple tends to contradict "CPU and Memory efficient". – Has QUIT--Anony-Mousse Mar 25 '17 at 21:08