Array of set-of-integers in MATLAB

Question

I have n different unique unsigned 32-bit integers. For each integer in for-loop I will generate a random index (address) and I want to put that number is a set specified by that index (address). For example in for-loop I reach to number 7 and my code produces 13 as address, then I need to add the number 7 to the 13th set. Having index (address) range of 1 to k, I will need k different sets. Currently I am using "cell" data structure in MATLAB.

array_of_sets=cell(n,1);

and when I want to add new member to ith set, I will index by array_of_sets{i} and then I will concatenate my new number.

My problem is that this approach is not memory efficient nor time efficient. Can anyone please guide me to a more efficient way to do this.

This is a simplified version my code so far:

array_of_sets=cell(k,1);
for i=1:n 
   address=something_genrated_randomly;
   array_of_sets{address}=[array_of_sets{address},uint32(i)]; %Add to the corresponding set(One specific Cell)
end

Output: Given the index ind, outputs the the ind^th set of integers.

Basically what I am looking for is similar to ArrayList<Set<Integer>> from Java but in MATLAB.

Can you provide some code to demonstrate how you are currently storing your data and the nature of your data? — Suever, Jan 21 '16 at 20:56
Please don't add relevant code in comments: edit your question and include every necessary information in order to create a [mcve]. — Andras Deak -- Слава Україні, Jan 21 '16 at 21:13
No problem. What do you mean by `Table{address+(Lind-1)*(2^k)}uint32(i)`? That doesn't seem to be valid MATLAB. — Andras Deak -- Слава Україні, Jan 21 '16 at 21:16
I see the white space is omitted. Now it's fixed! @AndrasDeak — Sadegh Riazi, Jan 21 '16 at 21:22
Indeed. Now, could you say a few words about what you are actually trying to achieve? It's far from clear, and your solution seems very complicated (while your problem might not be). — Andras Deak -- Слава Україні, Jan 21 '16 at 21:31
I was serious about saying a few "words". But I think I get a general idea: you want to assign the values in `1:n` to a bunch of random indices, and if there are multiple hits of the same index, then you want to append the new values to the previous ones. Well, if you want to have vectors of different size with each index (`address`), then you can only use `cell`s. But they really are slow. One other option is to use an array, but then you must zero-pad it to have a rectangular shape. If the array is very sparse, you can use a sparse matrix. — Andras Deak -- Слава Україні, Jan 21 '16 at 23:29
@AndrasDeak Thanks for your help. Yeah, I think you got what is my problem but both solutions seem not very efficient. — Sadegh Riazi, Jan 22 '16 at 17:34
@Sadegh It might be just me, but I still don't find your question clear:) Here's what I think of when I say "well-formed question": http://stackoverflow.com/questions/34726725/rendering-matlab-patch-faces-with-plotly-fig2plotly . Anyway, your intentions are probably clear together with your comments, so people might be able to answer regardless (and I don't want to bug you, you just seemed to appreciate feedback). — Andras Deak -- Слава Україні, Jan 22 '16 at 19:39
@Sadegh you may want to say some words about the actual problem. My point is that you should specify indata and wanted output. You do not really have to bother so much about suggesting the "wrong" output type. In case you, for example, would propose using a `cell` and there are better alternatives, this will most likely be stated in a comment. — patrik, Jan 25 '16 at 07:37

patrik · Answer 1 · 2016-01-26T06:50:21.220

Cells requires some extra memory to be able to store the information you need. This seems to be 112 bytes for a full cell independently matrix size and data type. For an empty cell 8 bytes is allocated, which is the same as a 64-bit pointer (one can assume this makes the header of each cell 112-8=104 bytes). This means that you will experience a huge waste of memory in case your arrays are short. In case you can guarantee that all the cells are smaller than 28 elements (112/4), you will get away cheaper with a 3-dimensional vector for every situation (zero-padded). However in case some cells are overly represented you will probably save memory by using cells. Also, as others have stated, cells are slow as well. So that may be something to take into account if execution time is a problem. Further in case the RAM is getting full you will start swapping and this will slow execution tremendously. If this is the problem, getting the memory down should be the main problem.

a = cell(1,2);
a{1} = zeros(100);
b=zeros(100);
c = cell(1,2);
d = cell(10,10);
e = cell(10,10);
d(:,:) = {zeros(10)};
e(:,:) = {zeros(10,10,10)};
f = zeros(100,1000);
whos
Name        Size               Bytes  Class     Attributes

a           1x2                80120  cell                
b         100x100              80000  double              
c           1x2                   16  cell                
d          10x10               91200  cell                
e          10x10              811200  cell                
f         100x1000            800000  double

EDIT

One way to create a non repeating set of integers is this way when using a zero padded array. Assume that you want 100 sets of about 1000 values each. Also in case you assume the distribution of random value is uniform they will probably be fairly evenly distributed. Assume a variation of an average of 10 values per set. This means that 10*number_of_sets values will be zero.

maxVal = 100000;
nSets = 100;
nonAssigned = 1000;
sets = randperm(maxVal);
sets(sets > (maxVal-nonAssigned)) = 0;
sets = reshape(vec,100,1000);

This will set all the "empty" values to zero and in this case we have about 1% memory overhead. This is still better than using an HashSet or TreeSet taking up 32*SIZE bits. In case this 1% is too much you will probably have to iterate and generate a few of the sets iteratively. In case you need this amount of data I do not think you have many options. In case this causes memory trouble you may want to look into if writing the code in c and mex the function will solve your problems.

Thanks @patrik. Yes... the problem is that it seems there is no similar ArrayList> of Java in MATLAB!! — Sadegh Riazi, Jan 22 '16 at 17:41
@Sadegh I am not aware of any type like that in Matlab. A `cell` is probably as good as it gets. What we **can** provide is an analysis of the memory consumption of the `cell` type and suggest alternatives to when `cell`s is more efficient or when other data types can be used instead to save memory. Regarding having unique entries in each cell is kind of implicitly said, but not clearly stated. Further I am not really sure the Matlab code you have does that job. Apart from that I assume that calculation time is a problem since nothing is stated about that. — patrik, Jan 25 '16 at 07:30
Both the calculation time and memory usage is a problem for me. But my current code actually does the job only the performance is an issue for me. — Sadegh Riazi, Jan 25 '16 at 18:53
@Sadegh I am also quite amazed that you have selected ArrayList> as a memory friendly data model. As far as I know the most common sets, HashSet and TreeSet are not really memory friendly. — patrik, Jan 26 '16 at 06:32

Geoff · Answer 2 · 2016-01-25T21:04:18.267

I think your question boils down to this:

Question: How can I randomly assign a set of n integers into k subsets efficiently?

If that's indeed what you're asking, here's a simple solution that only requires storage of 2*n values in memory, avoiding cell and zero-padded arrays, and uses logical indexing for speed.

Solution: Assign a vector of random "addresses", then use logical indexing whenever you need to refer to a specific set:

%// for example:
n = 10;
k = 4;

%// 1-by-n vector of integers, for example:
ints = 1:n;

%// 1-by-n vector of random "addresses" between 1 and k
address = randi([1,k],[1,n]);

Now you can access any subset using logical indexing:

set1 = ints(address==1);
set2 = ints(address==2);
...
setk = ints(address==k);

Memory-wise, you'll need to store 2*n values: the integers and their addresses. If the subsets are all approximately the same size, then using a zero-padded array may be cheaper.

Example:

>> [ints;address]

ans =

 1     2     3     4     5     6     7     8     9    10
 4     4     1     4     3     1     2     3     4     4

>> set1=ints(address==1)
>> set2=ints(address==2)
>> set3=ints(address==3)
>> set4=ints(address==4)

set1 =

 3     6


set2 =

 7


set3 =

 5     8


set4 =

 1     2     4     9    10

You could also use sortrows to re-order the integers and addresses:

>> sortrows([ints;address]',2)'

ans =

 3     6     7     5     8     1     2     4     9    10
 1     1     2     3     3     4     4     4     4     4

Array of set-of-integers in MATLAB

2 Answers2