Consider a situation such as representing a sparse matrix. For example, the matrix could be 1,000,000 rows x 1,000,000 cols (or some other large size), with perhaps 50, 100, or a few thousand cells being non-zero values at any particular time.
I'm attempting to discern the best C++ data structure for representing this. The brute force and very bad answer would be (example only puts a value in 1 cell for brevity, imagine a few hundred or a few thousand cells being filled):
int numRows = 1000000;
int numCols = 1000000;
std::vector<std::vector<int>> sparseMatrix(numRows, std::vector<int>(numCols, 0));
int currentRow = 12345;
int currentCol = 98765;
sparseMatrix[currentRow][currentCol] = 10;
std::cout << "\n" << "sparseMatrix[currentRow][currentCol] = " << sparseMatrix[currentRow][currentCol] << "\n\n";
Clearly this is a disaster due to 99+% of the memory dedicated to the data structure not being used.
The next intuitive option (to me at least) was this:
std::unordered_map<std::pair<int, int>, int> sparseMatrix;
int currentRow = 12345;
int currentCol = 98765;
std::pair<int, int> rowCol = std::make_pair(currentRow, currentCol);
sparseMatrix[rowCol] = 10;
std::cout << "\n" << "sparseMatrix[rowCol] = " << sparseMatrix[rowCol] << "\n\n";
Unfortunately this fails to compile with the error:
attempting to reference a deleted function
After some Googling on this topic it seems unordered_map
is not set up to use a pair as a key.
As far as I can tell, there are 4 remaining legitimate options:
1) Use a map
, which does accept a pair of integers as a key, instead of an unordered_map
, ex (this compiles and runs):
std::map<std::pair<int, int>, int> sparseMatrix;
int currentRow = 12345;
int currentCol = 98765;
std::pair<int, int> rowCol = std::make_pair(currentRow, currentCol);
sparseMatrix[rowCol] = 10;
std::cout << "\n" << "sparseMatrix[rowCol] = " << sparseMatrix[rowCol] << "\n\n";
2) Use an unordered_map
of unordered_map
s, ex (this also compiles and runs):
std::unordered_map<int, std::unordered_map<int, int>> sparseMatrix;
int currentRow = 12345;
int currentCol = 98765;
sparseMatrix[currentRow][currentCol] = 10;
std::cout << "\n" << "sparseMatrix[currentRow][currentCol] = " << sparseMatrix[currentRow][currentCol] << "\n\n";
3) Make my own hash function for the row and col integers and feed that into a more typical std::unordered_map<int, int>
. This seems like a very bad option since if two integer pairs mapped to the same hash key that would be difficult to deal with.
4) Use boost::hash, which I gather would look something like:
std::unordered_map<std::pair<int, int>, int, boost::hash<pair<int, int>>> sparseMatrix;
I'd tend to not prefer this option b/c 1) the data structure looks very awkward, 2) I'm not sure how to do the rest of the implementation, and 3) in some cases boost may not be available.
So to clarify my questions, here they are:
1) Which option above is best for most situations? (I'd really prefer to stick to #1 or #2 if reasonably possible).
2) From what I know about map
s (red-black trees) vs unordered_map
s (hash tables) I'm under the impression that #1 would be the best on memory but #2 would be faster, is my understanding correct in this case?
3) If I'm correct on #1 being better on memory and #2 being faster, is there a clear winner in the general case I mentioned above (1,000,000 x 1,000,000 matrix with typically about 1,000 values populated) or is the difference roughly a wash?
4) How difficult would the implementation of #3 and #4 be? If #3 and/or #4 were implemented very well, would the performance benefit be enough to outweigh the coding complexity cost vs #1 or #2?
Before somebody marks this post as a duplicate, I've read this post Why can't I compile an unordered_map with a pair as key? which touches on the options above but not provide an answer to the questions I've asked here.
Before somebody says "use the built-in boot sparse matrix", yes, I'm aware that boost and some other libraries have a sparse matrix class already available. I'm still asking this question however, b/c an unordered map where the key is 2 integers my be useful in some other cases, and also some people may not have access to boost or may desire to do their own more specific implementation for a certain purpose.