Fastest way to determine if a uint64 has been "seen" already

Question

I've been interested in optimizing "renumbering" algorithms that can relabel an arbitrary array of integers with duplicates into labels starting from 1. Sets and maps are too slow for what I've been trying to do, as are sorts. Is there a data structure that only remembers if a number has been seen or not reliably? I was considering experimenting with a bloom filter, but I have >12M integers and the target performance is faster than a good hashmap. Is this possible?

Here's a simple example pseudo-c++ algorithm that would be slow:

// note: all elements guaranteed > 0
std::vector<uint64_t> arr = { 21942198, 91292, 21942198, ... millions more };

std::unordered_map<uint64_t, uint64_t> renumber;
renumber.reserve(arr.size());

uint64_t next_label = 1;
for (uint64_t i = 0; i < arr.size(); i++) {
    uint64_t elem = arr[i];
    if (renumber[elem]) {
        arr[i] = renumber[elem];
    }
    else {
        renumber[elem] = next_label;
        arr[i] = next_label;
        ++next_label;
    }
}

Example input/output:

{ 12, 38, 1201, 120, 12, 39, 320, 1, 1  }
->
{ 1, 2, 3, 4, 1, 5, 6, 7, 7 }

Could you provide some (pseudo-)code of what you're trying to do? It's not very clear. — Aykhan Hagverdili, May 21 '22 at 04:48
How about their count? Is it limited to roughly 12 million numbers? — Aykhan Hagverdili, May 21 '22 at 04:59
Yes, the count is limited to some tens of millions (12 million I have seen). — SapphireSun, May 21 '22 at 05:00
`std::unordered_map renumber;` isn't a valid declaration. It's a map between uint64_t and what? — selbie, May 21 '22 at 05:00
Sorry about that, it's just trying to relabel the array from 1 ... N. The idea is that after the array is renumbered, I can then use techniques that use arrays instead of hash maps. I fixed the declaration. — SapphireSun, May 21 '22 at 05:01
`if (renumber[elem])` will do an insert into the map of a blank element if it's not already present. That's going to be a huge performance drain on your code. Use `renumber.find(elem) != renumber.end()` instead — selbie, May 21 '22 at 05:02
Thanks selbie, the idea here though (that's just to explain what I want to do) is to avoid any use of sets or maps at all while having a linear(ish) time algorithm that's faster than the fastest set/map algorithm. I was thinking that maybe there's a data structure out that that only remembers whether it's seen a number before and is fast. That would be sufficient to know when to increment label. — SapphireSun, May 21 '22 at 05:04
If the max number of elements is known, that means a hash table can be tuned to perform very well with [open addressing](https://en.wikipedia.org/wiki/Open_addressing). And considering this is the classic use case for a hash-table, I doubt you can do better than that. — Aykhan Hagverdili, May 21 '22 at 05:07
@selbie in fact you should store the found element to reuse later to avoid another search: `auto iter = renumber.find(elem); if (iter == renumber.end()) { /* insert */ } else { ... = *iter; }` — phuclv, May 21 '22 at 05:07
If I understand the problem correctly. You have an array of integers. You want to scan each number and replace each duplicate with an integer label. So if your input set was `{100, 200, 300, 100, 200, 300}`, then you'd want the array to become `{100,200,300,1,2,3}`. If that's the case, how do you distinguish between a label and an original occurrence of a number? — selbie, May 21 '22 at 05:28
Sorry, maybe I am tired and I wrote the code wrong, but the idea is to simply replace all labels with a smaller integer. The "originals" are remapped as well as the duplicates. The problem probably can even be formulated even more simply: What's the fastest way to extract unique integers from an array? A mapping can be assigned very trivially once that's done. Thank you for clarifying! — SapphireSun, May 21 '22 at 05:30
The only data structure I know that would be theoretically more efficient than an `unordered_map` would be a bit-vector (e.g. `std::bitset`, where N is at least one greater than the maximum `uint64_t` value that might be in your array). You could set the nth bit to indicate that you've encountered the value n. The problem with that is that unless your array's numbers are all relatively small, it would require an insane amount of memory to hold that many bits; enough that `unordered_set` would probably faster if only because it would be more likely to fit in RAM. — Jeremy Friesner, May 21 '22 at 05:48
I've read your post 3 times and all the comments and it's still not clear to me what the algorithm is. Can you give some examples of input/output? — bolov, May 21 '22 at 05:54
I started to post an answer, but I found a bug. i'll repost shortly. — selbie, May 21 '22 at 05:56
@JeremyFriesner He needs 2 things: 1) know when a number was already seen, 2) a number to map the number to. — Goswin von Brederlow, May 21 '22 at 09:44
@bolov Think perfect hash function with no collisions and no holes. — Goswin von Brederlow, May 21 '22 at 09:46
Got it. The example finally makes sense to me. Thank you for updating — bolov, May 21 '22 at 17:13

Matt Timmermans · Answer 1 · 2022-05-21T14:38:36.793

Your algorithm is not bad, but the appropriate data structure to use for the map is a hash table with open addressing.

As explained in this answer, std::unordered_map can't be implemented that way: https://stackoverflow.com/a/31113618/5483526

So if the STL container is really too slow for you, then you can do better by making your own.

Note, however, that:

90% of the time, when someone complains about STL containers being too slow, they are running a debug build with optimizations turned off. Make sure you are running a release build compiled with optimizations on. Running your code on 12M integers should take a few milliseconds at most.
You are accessing the map multiple times when only once is required, like this:

uint64_t next_label = 1;
for (size_t i = 0; i < arr.size(); i++) {
    uint64_t elem = arr[i];
    uint64_t &label = renumber[elem];
    if (!label) {
       label = next_label++;
    }
    arr[i] = label;
}

Note that the unordered_map operator [] returns a reference to the associated value (creating it if it doesn't exist), so you can test and modify the value without having to search the map again.

Thanks Matt! I'm also experimenting with absl::flat_hashmap which uses open addressing. https://abseil.io/docs/cpp/guides/container My question is mainly whether there's a data structure that can determine if a number has been reliably "seen" without needing to do lots of memory jumping out of cache and not much computation (so it's faster overall). — SapphireSun, May 22 '22 at 17:33
Not really. You will always have to store all the numbers that have been seen, and you will always have to check each one you see against a unique indication in memory of whether or not it's been seen. You may initially divide them into buckets that can be checked with a table that fits into L2 cache, but I don't know if that would help overall. — Matt Timmermans, May 22 '22 at 20:00

selbie · Answer 2 · 2022-05-21T05:58:30.160

Updated with bug fix

First, anytime you experience "slowness" with a std:: collection class like vector or map, just recompile with optimizations (release build). There is usually a 10x speedup.

Now to your problem. I'll show a two-pass solution that runs in O(N) time. I'll leave it as an exercise for you to convert to a one-pass solution. But I'll assert that this should be fast enough, even for vectors with millions of items.

First, declare not one, but two unordered maps:

std::unordered_map<uint64_t, uint64_t> element_to_label;
std::unordered_map<uint64_t, std::pair<uint64_t, std::vector<uint64_t>>> label_to_elements;

The first map, element_to_label maps an integer value found in the original array to it's unique label.

The second map, label_to_elements maps to both the element value and the list of indices that element occurs in the original array.

Now to build these maps:

element_to_label.reserve(arr.size());
label_to_elements.reserve(arr.size());

uint64_t next_label = 1;
for (size_t index = 0; index < arr.size(); index++)
{
    const uint64_t elem = arr[index];
    auto itor = element_to_label.find(elem);
    if (itor == element_to_label.end())
    {
        // new element
        element_to_label[elem] = next_label;

        auto &p = label_to_elements[next_label];
        p.first = elem;
        p.second.push_back(index);
        next_label++;
    }
    else
    {
        // existing element
        uint64_t label = itor->second;
        label_to_elements[label].second.push_back(index);
    }
}

When the above code runs, it's built up a database all values in the array, their labels, and indices where they occur.

So now to renumber the array such that all elements are replaced with their smaller label value:

for (auto itor = label_to_elements.begin(); itor != label_to_elements.end(); itor++)
{
    uint64_t label = itor->first;
    auto& p = itor->second;
    uint64_t elem = p.first; // technically, this isn't needed. It's just useful to know which element value we are replacing from the original array
    const auto& vec = p.second;

    for (size_t j = 0; j < vec.size(); j++)
    {
        size_t index = vec[j];
        arr[index] = label;
    }
}

Notice where I assign variables by reference with the & operator to avoid making an expensive copy of any value in the maps.

So if your original vector or array was this:

{ 100001, 2000002, 300003, 400004, 400004, 300003, 2000002, 100001 };

Then the application of labels would render the array as this: {1,2,3,4,4,3,2,1}

And what's nice you still have a quick O(1) look operator to map any label in that set back to its original element value using label_to_elements

Thank you selbie, this is essentially the idea for the current algorithm I have, but I'm seeing if I can make it even faster by finding a way to make it more efficient than even the abseil flat_hashmap. The reason is that there is a large scale research computation and shaving seconds off can result in materially shorter run times. I'm not sure this is possible, but I asked the question because I've run into this problem before and feel there must be some way... That said, I haven't tried optimizing the hash policy. — SapphireSun, May 21 '22 at 06:16
First run of optimization: Benchmark and measure your performance using profiling tools. You're assuming the implementation of hash table is the root cause of a performance issue. I'll ponder this: Maybe that's not the hash table implementation, but something else in your code. You'd be surprised what profilers can turn up these days. — selbie, May 21 '22 at 06:20
It's also really hard to beat the performance of hashing an integer. One thing you could try is to mess around the max_load_factor and bucket sizes of the std::unordered_map class. Also, don't forget to compile with max optimizations on. — selbie, May 21 '22 at 06:32
The labels are limited to 12 million so `uint32_t` would be enough and should be faster. — Goswin von Brederlow, May 21 '22 at 09:49
How could this code possibly be faster than the OP's implementation in the question? — Matt Timmermans, May 21 '22 at 14:27

Fastest way to determine if a uint64 has been "seen" already

2 Answers2