2

The context and the problem below are only examples that can help to visualize the question.

Context: Let's say that I'm continously generating random binary vectors G with length 1x64 (whose values are either 0 or 1).

Problem: I don't want to check vectors that I've already checked, so I want to create a kind of table that can identify what vectors are already generated before.

So, how can I identify each vector in an optimized way?

My first idea was to convert the binary vectors into decimal numbers. Due to the maximum length of the vectors, I would need 2^64 = 1.8447e+19 numbers to encode them. That's huge, so I need an alternative.

I thought about using hexadecimal coding. In that case, if I'm not wrong, I would need nchoosek(16+16-1,16) = 300540195 elements, which is also huge.

So, there are better alternatives? For example, a kind of hash function that can identify that vectors without repeating values?

Community
  • 1
  • 1
Víctor Martínez
  • 854
  • 2
  • 11
  • 29
  • 2
    Are you going to actually generate `2^64` different vectors? If it's fewer than `2^32`, I'd just encode them as `uint64` – Suever Jan 10 '17 at 23:12
  • 1
    The only hash table I know of that's implemented in MATLAB is `containers.Map`. I don't know how efficient a map will be in this instance, but it's worth a try. Trying to create even a bitset of 2^64 bits is probably overly memory-intensive. – beaker Jan 10 '17 at 23:16
  • @Suever I'mt not performing an exhaustive search, so it is quite improbable that I'll reach the `2^32`. Thanks for the sugerence! – Víctor Martínez Jan 11 '17 at 08:37
  • @beaker Effectively, `containers.Map` is what I wanted! – Víctor Martínez Jan 11 '17 at 10:35

1 Answers1

1

So you have 64 bit values (or vectors) and you need a data structure in order to efficiently check if a new value is already existing?

Hash sets or binary trees come to mind, depending on if ordering is important or not.

Matlab has a hash table in containers.Map.

Here is a example:

tic;
n = 1e5; % number of random elements
keys = uint64(rand(n, 1) * 2^64); % random uint64

% check and add key if not already existing (using a containers.Map)
map = containers.Map('KeyType', 'uint64', 'ValueType', 'logical');
for i = 1 : n
    key = keys(i);
    if ~isKey(map, key)
        map(key) = true;
    end
end
toc;

However, depending on why you really need that and when you really need to check, the Matlab function unique might also be something for you.

Just throwing out duplicates once at the end like:

tic;
unique_keys = unique(keys);
toc;

is in this example 300 times faster than checking every time.

Community
  • 1
  • 1
NoDataDumpNoContribution
  • 10,591
  • 9
  • 64
  • 104
  • 1
    Awesome answer! `containers.Map` adapts perfectly to what I wanted. In addition, it allows me to set `double` values asigned to each key :) – Víctor Martínez Jan 11 '17 at 10:34