more efficient structure as unordered_map, int>

Question

I have about 20,000,000 pair<int, int> which I need to associate to ints. I did so with an unordered_map<pair<int, int>, int>. Profiling my algorithm shows that checking whether an entry exists or not

bool exists = myMap[make_pair(a, b)] != NULL

is the performance bottleneck. I thought that retrieving this information from an unordered_map would be really fast, as it is O(1). But constant time can be slow if the constant is big...

My hash-function is

template <>
struct tr1::hash<pair<int, int> > {
public:
        size_t operator()(pair<int, int> x) const throw() {
             size_t h = x.first * 1 + x.second * 100000;
             return h;
        }
};

Do you know any better data-structure for my problem?

Obviously I can't just store the information in a matrix, hence the amount of memory wouldn't fit into any computer in existence. All I know about the distribution is that myMap[make_pair(a, a)] doesn't exist for any a. And that all ints are in a continuous range from 0 to about 20,000,000.

Think of it as a sparse 20,000,000x20,000,000-Matrix with about 20,000,000 entries but never on the main diagonal.

Suggestion

Would a vector<pair<int, int>>* (array with N entries) expected to be faster? The lookup for a would be trivial (just the index of the array) and then I would iterate through the vector, comparing the first value of the pair to b.

BIG UPDATE

I uploaded the raw data so you can see the structure.

Must the ints in the pair be retrievable? Or is it fine to create two maps? `umap` and `umap>`? — RedX, Jul 11 '14 at 07:48
Note that `myMap[make_pair(a, b)] != NULL` does not do what you think it does. It inserts the pair if it doesn't exist, and compares the mapped value to `0` (which is what `NULL` expands to). It does not check for existence at all. In modern C++, you should never use `NULL`. — Angew is no longer proud of SO, Jul 11 '14 at 07:49
We need to know your algorithm to find a better data structure. Maybe you just need an two-dimensional array, but maybe you need something else. — johnchen902, Jul 11 '14 at 07:56
If you know anything about the data distribution, you could provide your own hash function, adapted to produce as little collisions as possible. — Angew is no longer proud of SO, Jul 11 '14 at 07:58
The keys are created dynamically. There's just not enough memory to create a `umap` for all possible combinations of `some_key`. — user2033412, Jul 11 '14 at 08:07
Since maps are usually implemented as binary search trees I wouldn't say that checking if a key exists is O(1) but rather O(log N). Anyway it would be helpful if you describe the purpose of your whole structure (i.e. what and why you want to do this exactly) and (as Blastfurnace already mentioned) provide us with the hash function you use. — a_guest, Jul 11 '14 at 08:13
@a_guest `std::unordered_map` is a hashmap, not a tree. That's `std::map`. — Angew is no longer proud of SO, Jul 11 '14 at 08:14
Yes, it's a 64-bit platform. It's a 20,000,000x20,000,000-matrix and I have to check about 50,000 entries. Most of them will be null and only in the few cases they aren't I'm interested in the value. — user2033412, Jul 11 '14 at 08:23
So then you should just save the entries which are different from zero. This will greatly reduce the size of your map. — a_guest, Jul 11 '14 at 08:24
@a_guest: That's what I do by using a map and checking whether an entry exists or not. — user2033412, Jul 11 '14 at 08:27
Ok then you could use two nested binary trees, the first one for the row index and the second one for the column index. Your own implementation or maybe use std::map. — a_guest, Jul 11 '14 at 08:41
Maybe you should show your algorithm. Sometimes you just need a different algorithm. I once got an TLE when trying to use sparse matrix like this as part of my algorithm and the solution is a hand-crafted two-dimensional dynamic-allocated segment tree. — johnchen902, Jul 11 '14 at 08:53
I doubt you are. What you need is something self-made which fits your needs (like the suggested nested binary search tree). — a_guest, Jul 11 '14 at 09:05
I would recommend to look for a library which can deal with sparse matrices. All these problems have been solved nicely already. A nice overview seems to be here: http://www.netlib.org/utk/people/JackDongarra/la-sw.html — Peter - Reinstate Monica, Jul 11 '14 at 09:08
You could try using unordered_map::bucket_size to see if hash collisions are a problem. — Falk Hüffner, Jul 11 '14 at 09:24

score 5 · Answer 1 · answered Jul 11 '14 at 07:48

5

Have you tried using myMap.find(make_pair(a,b)) != myMap.end() ? operator[] creates the element if it does not exist. I would expect find to be faster.

answered Jul 11 '14 at 07:48

Baptiste Wicht

7,472
7
45
110

I just tried `find` vs. `[]` and it changed nothing. :-( – user2033412 Jul 11 '14 at 07:53
you just made my day! thanks for this! I had been trying to this question on codechef for 2 days and this was the optimizing i needed! thanks a lot!! – otaku Oct 08 '14 at 12:37

Angew is no longer proud of SO · Answer 2 · 2014-07-11T09:30:53.443

3

First off, myMap[make_pair(a, b)] != NULL does not do what you think it does. It inserts the pair if it doesn't exist, and compares the mapped value to 0 (which is what NULL expands to). It does not check for existence at all. (Note that in modern C++, you should never use NULL. Use 0 for numbers and nullptr for pointers).

As for the main topic, your hash function doesn't seem too good. Don't forget that arithmetic on ints is done in ints. Since on most compilers int is 32-bit, its maximum value is a little over 2,000,000,000. So 20,000,000 * 10,000 is way bigger than that, leading to overflow (and undefined behaviour).

Given the number of your data, I assume you're on a 64-bit platform, which means size_t is 64 bits long. So you might get better results with a hash function like this:

size_t operator()(pair<int, int> x) const throw() {
     size_t f = x.first, s = x.second;
     return f << (CHAR_BIT * sizeof(size_t) / 2) | s;
}

This should produce significantly less collisions (and have defined behaviour) that what you have now.

If this doesn't help, you could also try a two-step approach:

std::unordered_map<int, std::unordered_map<int, int>>

Lookup by x.first first, then by x.second. I don't know if this would help; measure and see.

edited Jul 11 '14 at 09:30

answered Jul 11 '14 at 08:21

Angew is no longer proud of SO

167,307
17
350
455

By using your hash-function the algorithm becase unusably slow... like... more than 100 times slower! I'm really surprised! – user2033412 Jul 11 '14 at 08:31
2

@user2033412 Interesting. Have you tried profiling inside the `find()` of the map to find why it takes so long (even in your original version)? As a sanity check, you *are* running optimised code, right? – Angew is no longer proud of SO Jul 11 '14 at 08:38
I _am_ running optimized code. Profiling shows the following during insertion (creating of the umap is also incredibly slow...): [profiler](http://s14.directupload.net/images/140711/693ii8jl.png) – user2033412 Jul 11 '14 at 08:47
4

sizeof(size_t) / 2 will be something like 4. You probably want CHAR_BIT * sizeof(size_t) / 2. – Falk Hüffner Jul 11 '14 at 09:22
I just tried the corrected version of Falk Huffner. Now it runs as fast as the original version. This version should not produce any collisions at all and hence be as performant as possible, right? – user2033412 Jul 11 '14 at 09:26
@user2033412 To the best of my knowledge, yes. – Angew is no longer proud of SO Jul 11 '14 at 09:31
@user2033412 Your question says that the bottleneck is lookup, but the profiler result is for insertion - so which is it? – Angew is no longer proud of SO Jul 11 '14 at 09:32
I'm only interested in lookup. I showed you insertion because it was also really slow and probably had the same problem: collisions. Just ignore it; you both fixed it! :-) – user2033412 Jul 11 '14 at 09:36

score 2 · Answer 3 · answered Jul 11 '14 at 10:08

2

Main thing is definitely to avoid adding default-constructed elements with every search:

bool exists = myMap[make_pair(a, b)] != NULL; // OUCH

bool exists = myMap.find(make_pair(a, b)) != myMap.end();  // BETTER

iterator i = myMap.find(make_pair(a, b);
if (i != myMap.end()) ... else ...;      // MAY BE BEST - SEE BELOW

And the great hash challenge... woo hoo! This might be worth a shot, but a lot depends on how the numbers in the pairs are distributed and your implementation's std::hash (which is often pass-through!):

    size_t operator()(pair<int, int> x) const throw() {
         size_t hf = std::hash(x.first);
         return (hf << 2) ^ (hf >> 2) ^ std::hash(x.second);
    }

You may also find it faster if you replace the pair with int64_ts, so that the key comparisons are definitely simple integer comparisons rather than cascaded.

Also, what are you doing after the test for existence? If you need to access/change the value associated with the same key then you should save the iterator find returns and avoid another search.

answered Jul 11 '14 at 10:08

Tony Delroy

102,968
15
177
252

I'm now using Angew's hash-function which is basically comparing `int64_t` (see his answer below). I need to access the value but as most looked up pairs don't exists this is not critical. – user2033412 Jul 11 '14 at 10:19
@user2033412: I have reason to think my hash function may work better than Angew's (or I wouldn't have bothered to post it after seeing his) - it gets technical but in the end if you care about speed you should measure both then choose. When I talk about 64 bit `unordered_map` keys above, I am not talking about the width of the hash, but rather combining the two `int`s in the key into a single `int64_t` before storing it in the table, then extracting it again afterwards, or storing a type with a `union` of `int64_t` and `int32_t`. – Tony Delroy Jul 11 '14 at 10:54
1

Separately, you may be able to use some kind of bloom filter to eliminate many of the failing lookups without searching the hash table. Is there anything more you tell me about the data? What kind of histogram would you get for the number of repeats of any given `.first` value? Clearly the average is one repeat, but what's the shape of the curve? Is the curve for `.second` similar? Is `.first` completely uncorrelated to `.second`? – Tony Delroy Jul 11 '14 at 10:59
Im currently creating a text-file containing the complete set of raw-data in the following format (one entry per line): [a,b] -> c. I'll upload it and post the link asap. – user2033412 Jul 11 '14 at 11:10
Could you take a look at it, @tony-d? – user2033412 Jul 15 '14 at 16:31
@user2033412: I had a look on the weekend... nothing struck me about your dataset as lending itself to more efficient hashing or indexing... the only unexpected find was that in the key pairs the same values are often seen in .first and .second (~300k values only appeared in one or the other, but 12 million in both), there's actually about 36m values (all from memory), and a wide variety of repetitions. I didn't try plotting it as an x/y distribution to see if there were any pattern there, but that's a long shot. Summarily, I'm stuck.... – Tony Delroy Jul 16 '14 at 00:38

score 1 · Accepted Answer · answered Jul 20 '14 at 12:42

1

As suggestet, I went with a vector<pair<int, int>>* with N entries. It's about 40% faster than the unordered_map.

answered Jul 20 '14 at 12:42

user2033412

1,950
2
27
47

score 0 · Answer 5 · answered Jul 11 '14 at 08:32

0

I suggest you test with a better hash function. You can find examples if you search here on SO but this is one possible implementation.

struct pair_hash {
    template <typename T1, typename T2>
    size_t operator()(const std::pair<T1, T2> &pr) const {
        using std::hash;
        return hash<T1>()(pr.first) ^ hash<T2>()(pr.second);
    }
};

answered Jul 11 '14 at 08:32

Blastfurnace

18,411
56
55
70

1

This will give collisions on the main diagonal of the matrix (i.e. (0,0), (1,1), ...) where usually the elements are considered to be non-zero. – a_guest Jul 11 '14 at 08:35
1

@a_guest From OP: *All I know about the distribution is that `myMap[make_pair(a, a)]` doesn't exist for any `a`.* – johnchen902 Jul 11 '14 at 08:37
@a_guest: True, you could always shift or complement one of the hash values to reduce that problem. – Blastfurnace Jul 11 '14 at 08:37
I just tried it and it's slightly slower than the original hash-function. – user2033412 Jul 11 '14 at 08:38

more efficient structure as unordered_map, int>

Suggestion

BIG UPDATE

5 Answers5