2

Consider a situation such as representing a sparse matrix. For example, the matrix could be 1,000,000 rows x 1,000,000 cols (or some other large size), with perhaps 50, 100, or a few thousand cells being non-zero values at any particular time.

I'm attempting to discern the best C++ data structure for representing this. The brute force and very bad answer would be (example only puts a value in 1 cell for brevity, imagine a few hundred or a few thousand cells being filled):

int numRows = 1000000;
int numCols = 1000000;
std::vector<std::vector<int>> sparseMatrix(numRows, std::vector<int>(numCols, 0));

int currentRow = 12345;
int currentCol = 98765;
sparseMatrix[currentRow][currentCol] = 10;

std::cout << "\n" << "sparseMatrix[currentRow][currentCol] = " << sparseMatrix[currentRow][currentCol] << "\n\n";

Clearly this is a disaster due to 99+% of the memory dedicated to the data structure not being used.

The next intuitive option (to me at least) was this:

std::unordered_map<std::pair<int, int>, int> sparseMatrix;

int currentRow = 12345;
int currentCol = 98765;

std::pair<int, int> rowCol = std::make_pair(currentRow, currentCol);

sparseMatrix[rowCol] = 10;

std::cout << "\n" << "sparseMatrix[rowCol] = " << sparseMatrix[rowCol] << "\n\n";

Unfortunately this fails to compile with the error:

attempting to reference a deleted function

After some Googling on this topic it seems unordered_map is not set up to use a pair as a key.

As far as I can tell, there are 4 remaining legitimate options:

1) Use a map, which does accept a pair of integers as a key, instead of an unordered_map, ex (this compiles and runs):

std::map<std::pair<int, int>, int> sparseMatrix;

int currentRow = 12345;
int currentCol = 98765;

std::pair<int, int> rowCol = std::make_pair(currentRow, currentCol);

sparseMatrix[rowCol] = 10;

std::cout << "\n" << "sparseMatrix[rowCol] = " << sparseMatrix[rowCol] << "\n\n";

2) Use an unordered_map of unordered_maps, ex (this also compiles and runs):

std::unordered_map<int, std::unordered_map<int, int>> sparseMatrix;

int currentRow = 12345;
int currentCol = 98765;

sparseMatrix[currentRow][currentCol] = 10;

std::cout << "\n" << "sparseMatrix[currentRow][currentCol] = " << sparseMatrix[currentRow][currentCol] << "\n\n";

3) Make my own hash function for the row and col integers and feed that into a more typical std::unordered_map<int, int>. This seems like a very bad option since if two integer pairs mapped to the same hash key that would be difficult to deal with.

4) Use boost::hash, which I gather would look something like:

std::unordered_map<std::pair<int, int>, int, boost::hash<pair<int, int>>> sparseMatrix;

I'd tend to not prefer this option b/c 1) the data structure looks very awkward, 2) I'm not sure how to do the rest of the implementation, and 3) in some cases boost may not be available.

So to clarify my questions, here they are:

1) Which option above is best for most situations? (I'd really prefer to stick to #1 or #2 if reasonably possible).

2) From what I know about maps (red-black trees) vs unordered_maps (hash tables) I'm under the impression that #1 would be the best on memory but #2 would be faster, is my understanding correct in this case?

3) If I'm correct on #1 being better on memory and #2 being faster, is there a clear winner in the general case I mentioned above (1,000,000 x 1,000,000 matrix with typically about 1,000 values populated) or is the difference roughly a wash?

4) How difficult would the implementation of #3 and #4 be? If #3 and/or #4 were implemented very well, would the performance benefit be enough to outweigh the coding complexity cost vs #1 or #2?

Before somebody marks this post as a duplicate, I've read this post Why can't I compile an unordered_map with a pair as key? which touches on the options above but not provide an answer to the questions I've asked here.

Before somebody says "use the built-in boot sparse matrix", yes, I'm aware that boost and some other libraries have a sparse matrix class already available. I'm still asking this question however, b/c an unordered map where the key is 2 integers my be useful in some other cases, and also some people may not have access to boost or may desire to do their own more specific implementation for a certain purpose.

cdahms
  • 3,402
  • 10
  • 49
  • 75
  • To the close vote person, I already Googled on those. None of the pages I found answered the questions I've stated here. Please don't vote to close a legit question. – cdahms Jul 07 '19 at 15:00
  • Did you google for "sparse matrix"? And took a look at existing algebra libraries? There should be plenty of existing techniques here. – Jens Jul 07 '19 at 15:02
  • Yes I did, as I stated in the question I looked at boost and some other options (Eigen, etc.) but I'd like to be able to do my own implementation in case those are not available in some cases or if additional customization is needed. – cdahms Jul 07 '19 at 15:03
  • How about classical approach to 2D matrices using single index? `index = row * num_of_cols + col;` This will save you from having two integers as keys, and the problem will be much simpler. – Yksisarvinen Jul 07 '19 at 15:14
  • *it seems unordered_map is not set up to use a pair as a key* But nothing stops you defining suitable types so that you can use an `unordered_map` with a pair as a key. It would be a serious limitation if `unordered_map` couldn't be used with any type as a key. There are a few types its set up with **by default** but it's easy to add more types. – john Jul 07 '19 at 15:15
  • @Yksisarvinen it seems you'r alluding to my #3 option. My concern is the derived index would overflow an integer. Also, if the # or rows or columns changed during runtime as far as I can tell this would not work. – cdahms Jul 07 '19 at 15:18
  • 1
    @cdahms See here for example, https://stackoverflow.com/questions/32685540/why-cant-i-compile-an-unordered-map-with-a-pair-as-key. This examples is using a pair of strings but it's exactly the same idea for a pair of integers. – john Jul 07 '19 at 15:20
  • @john thanks for the suggestion. It seems your hinting at my #3 option. Could you provide an answer on how to set up that custom type? – cdahms Jul 07 '19 at 15:21
  • @cdahms Sorry I didn't notice option three. Of course defining good hash function is the thing. The bolier plate advice is to XOR the hashes of the two items in the pair (as in the example I linked). Whether that's good for you or not, obviously is something only you can asnwer. – john Jul 07 '19 at 15:23
  • @john yes, I saw that post as well, but the author of the accepted answer states that hashing based on the "^" operator would not be effective in significant size cases. I suppose that answer with a better hash function would be my #3 option, but I'm not sure how to implement that. It would be helpful if you would provide an answer with such an implementation if you are familiar with one. – cdahms Jul 07 '19 at 15:23
  • @cdahms Good hash functions depend entirely on the details of the data you intending to put into the hash table. So it's very hard for me to advice, since I know nothing about your data. – john Jul 07 '19 at 15:25
  • 1
    If you have 1e12 possible locations (1000000 * 1000000), it would easily fit in 64-bit integer (`std::int64_t`). You can go as far as 9e18 (or 1.8e19 with unsgined version). Do you plan matrices bigger than 1e9 rows and columns? – Yksisarvinen Jul 07 '19 at 15:25
  • If you are doing matrix computations then, if I recall correctly (my numerics courses were a coupled of years ago ...), there are algorithms that are designed for sparse matrices. There are probably algorithms with different properties for different problems – Jens Jul 07 '19 at 15:26
  • Should be closed as Too Broad. To meaningfully answer this question one would need to measure the performance (with realistic data sets) of the various proposed strategies and their platform implementations. This is far too much for the SO Q&A format. – Richard Critten Jul 07 '19 at 15:38
  • @Yksisarvinen, thanks for the suggestion, however making a hash key based on the number of rows or columns would preclude changing the size of the matrix dynamically, which I would like to retain the ability to do. – cdahms Jul 07 '19 at 15:39
  • @Richard Critten I respectfully disagree. I stated an example of an 1,000,000 x 1,000,000 matrix with about 1,000 cells having non-default values, I would suggest this is more than specific enough. – cdahms Jul 07 '19 at 15:53

2 Answers2

1

Clearly this is a disaster due to 99+% of the memory dedicated to the data structure not being used.

That's not clear at all. Modern OSes tend to provide the application with virtual memory that only gets backed with physical RAM when accessed, so only the memory pages you write elements into need backing RAM. If you have at most thousands of entries in your array, and each memory page is say 4k, you'd be using order-of tens of megabytes - hardly a strain on a typical modern machine. So, it's wasteful, but not necessarily problematically wasteful. It's not cache friendly - the performance implications thereof might prove the greater concern.

4) Use boost::hash, which I gather would look something like:

std::unordered_map<std::pair<int, int>, int, boost::hash<pair<int, int>>> sparseMatrix;

I'd tend to not prefer this option b/c 1) the data structure looks very awkward, 2) I'm not sure how to do the rest of the implementation, and 3) in some cases boost may not be available.

1) looks awkward? come on... 2) there is nothing else to do - you just use it like any other unordered_map 3) you can create your own then based on boost's (see this q):

template <class T>
inline void hash_combine(std::size_t& seed, const T& v)
{
    std::hash<T> hasher;
    seed ^= hasher(v) + 0x9e3779b9 + (seed<<6) + (seed>>2);
}

struct hash_pair
{
    std::size_t operator()(const std::pair<int, int>& p) const
    {
        std::size_t h = 0;
        hash_combine(h, p.first);
        hash_combine(h, p.second);
        return h;
    }
};

1) Which option above is best for most situations? (I'd really prefer to stick to #1 or #2 if reasonably possible).

None of your numbered options is the best for most situations: per your stated concerns about depending on boost, create your own based on boost's implementation of hash_combine is the best general solution based on Standard Library containers.

2) From what I know about maps (red-black trees) vs unordered_maps (hash tables) I'm under the impression that #1 would be the best on memory but #2 would be faster, is my understanding correct in this case?

Memory usage won't be dramatically different. GCC's hash tables uses a linked list to store values, wherein each value requires a dynamic memory allocation with linking pointers, plus a contiguous array for the buckets (each being a list iterator; the arrays will be (re)sized to maintain a reasonable load factor, so won't be particularly big). maps use a dynamic memory allocation per value too - but allocate a little extra for left/right pointers. Much of a muchness.

3) If I'm correct on #1 being better on memory and #2 being faster, is there a clear winner in the general case I mentioned above (1,000,000 x 1,000,000 matrix with typically about 1,000 values populated) or is the difference roughly a wash?

As mentioned, memory usage shouldn't be expected to be dramatically better for one over the other (though implementations may vary). As for faster, when there are so few values populated, just implement them both and measure. The advantages of hash tables more consistently dominate when the number of populated elements is large.

4) How difficult would the implementation of #3 and #4 be? If #3 and/or #4 were implemented very well, would the performance benefit be enough to outweigh the coding complexity cost vs #1 or #2?

As mentioned, you should be comparing #1 with a rip-off of #4. Forget #3 - it's fundamentally flawed as you realised yourself "very bad option since if two integer pairs mapped to the same hash key that would be difficult to deal with".

As for coding complexity - there is virtually none. Just copy the hash implementation above, specify the hash policy when instantiating the unordered_map, and get on with using it.

If you encounter actual problems when you're implementing the options, then ask a new question to get help.

Tony Delroy
  • 102,968
  • 15
  • 177
  • 252
  • 1
    Creating a vector of vectors requires zeroing each element in turn which thus touches each and every page. IMHO, that indeed is desastrous on the performance. – Ulrich Eckhardt Jul 07 '19 at 16:28
  • @UlrichEckhardt: solid point - it would be slow if you don't specify a custom allocator to inhibit the value initialisation (which isn't beginner level code - see Note [here](https://en.cppreference.com/w/cpp/named_req/DefaultInsertable)), or wrap the `int`s in a class that has a no-op constructor. (Some virtual memory implementations avoid consuming backing memory for all-zero pages, but that's wouldn't fully fix a value initialisation performance hit.) Or, `new` an array or array-of-arrays - more low-level memory coding that OP may not be comfortable with.... – Tony Delroy Jul 07 '19 at 17:18
  • I'd consider using a single vector anyways. Still, one point remains: If you want to distinguish a "used" from an "unused" element, you need to fill them with a default value. You can work around that by adding a second "index" vector with the actual data and cross-reference the two with each other which allows checking whether one is actually filled, but I don't think that's worth it. – Ulrich Eckhardt Jul 08 '19 at 08:09
0

It may or may not solve your problem, but one of your assumptions is wrong:

3) Make my own hash function for the row and col integers and feed that into a more typical std::unordered_map. This seems like a very bad option since if two integer pairs mapped to the same hash key that would be difficult to deal with.

Dealing with hash collisions is not something you have to do but something that unordered_map does for you. Even if all values' hashes mapped to the same integer, it would correctly make sure that different values are treated as different keys, even though the performance would degrade.

That said, a map of maps (map or unordered_map) would both work and provide reasonable performance assuming you only have the few elements that you mention.

Ulrich Eckhardt
  • 16,572
  • 3
  • 28
  • 55