Suggestions for Sparse Vector implementation without using outside libraries

Question

I have to use a sparse vector for some things in my current project. But, since I am not in charge of the project, I can not use whatever outside libraries I want. I only have STL and OpenCV available.

I have looked through several stackoverflow answered questions, but they either focus on a specific approach, comparison of a limited number of approaches (2) and outside libraries when they deal specifically with sparse vectors. There is also some excellent ideas for implementing a sparse matrix.

What I want is specifically a sparse vector (the indexes will always be in 1 dimension, the data is not relevant for this question). I would like something that will not be a project on it's own to implement, but could be used for more-than-demonstrative-purposes (e.g. I want to achieve a decent speed and not too much of a memory overhead) and hopefully re-used later. The options I have considered included:

adapting the OpenCV implementation of SparseMat to my needs
using a std::map to store the values (or maybe making a very simple wrapper that would return a default value in case of indexing a zero-element)
using a std::vector< std::pair < int , data_type > > where I could store the index and the data in the std::pair elements

Is any of these solutions better/worse for the general purpose use as a sparse vector? I know every approach to everything has it's ups and downs, but argumented suggestions on which approach to choose would be very much appreciated. Also, recommending an approach I have not considered would be more than welcome if anyone thinks he has a better suggestion.

The usage in my specific case is as follows:

the vector will most likely not be modified after it is created (right now I don't see any need for this, but I can not guarantee 100% that it will not come up)
a most common operation is anticipated to be dot product of two such vectors (so, more-or-less accessing the elements in a linear sequential way)
only lookup I can foresee right now is (maybe) checking weather a certain element is a zero-element or not
there will be around 500 non-zero elements expected
in short, most of the time the sparse vector would be used as the mathematical concept of a vector (a multi-dimensional point) without much need to examine each coordinate separately

Still, as I have written in my original question, I would like suggestions for a general-purpose Sparse Vector implementation.

How much data you will be storing? How will the data be accessed in a linear or random way? Do you need to insert/delete in the middle? Must the memory be contiguous? Do you need to find data? — RedX, May 29 '12 at 10:14
At 500 elements i don't see the need for anything fancy unless profiling tells you so. Do as Matthieu and shabaz say and go a map and see how that works for you. — RedX, May 29 '12 at 10:45
@RedX As this Sparse Vectors will be the most used datatype in my whole implementation, I started thinking about the most general good approach. Also, the number of elements is around 500 for the current implementation, and I wanted to check if there is a scalable approach that would have good performance even if the number of non-zero elements grows. In any case, the point was it all got me thinking about a most appropriate implementation, so I was mostly interested in general-purpose approach to implementing this efficiently. — penelope, May 29 '12 at 11:55

Shahbaz · Accepted Answer · 2012-05-29T12:08:15.900

I believe std::map would give you the best results. SpareseMat, I don't know, but among the two other methods you mentioned, std::map will give you O(log(n)) lookup as well as O(log(n)) insertion and deletion. The vector however requires a search over all its data (so it has O(n) lookup). It has O(1) insertion, but O(n) removal. My guess is that you will have a lot of lookup, so most probably std::map is better for you.

Depending on your application, you might want to use the vector method during initial creation of the structure and then convert it to map once you start using it to get the best of both worlds (but often this is not the case, for example in the cases where you have repeating indices).

Besides hash which should give you O(1) everything, but in reality might not, the lookup of O(log(n)) is the best you can hope for. You may come up with a vector that you can binary search on, or any other method based on searching for the data by comparison, but in the end they would all be O(log(n)) so you might as well use the easy-already-done std::map.

Update: Based on the update on your question, that indicates that the vector will most likely not be modified after it is created and a most common operation is anticipated to be dot product, I would suggest the following:

First, use a vector of pairs as you have suggested yourself. During creation, simply push_back and you get O(1) performance.¹ After that, you can sort the vector. The dot product would be quite simple²:

int dot = 0;
unsigned int index_v1 = 0, index_v2 = 0;
while (index_v1 < v1.size() && index_v2 < v2.size())
    if (v1[index_v1].first == v2[index_v2].first)
        dot += v1[index_v1++].second * v2[index_v2++].second;
    else if (v1[index_v1].first < v2[index_v2].first)
        ++index_v1;
    else
        ++index_v2;

Checking whether a certain element is a zero-element or not would be a simple binary search, checking whether the element could be found or not (O(log(n)) performance).

Given the fact that you would be using this structure as a point, I believe keeping it a vector would be better. You might want to later do cross product or other geometric operations.

Regarding the fact that you may need to insert something in the vector every now and then, then you would have to insert it in place (so the vector remains sorted). The performance would be O(n), but since it doesn't happen often, it shouldn't be an issue.

¹ Unless you have millions of these vectors, O(1) and O(log(n)) for n ~= 500 should not really make any noticeable difference.

² You could definitely also use a map and use iterators over it to do the dot product in order of index. The performance will be the same if std::map uses a threaded tree that lets you get to the next node in O(1).

@MatthieuM. if you want to use binary search obviously, but depending on the application, it may not be a good idea, since inserting in the middle of the vector is costly. That was why I suggested using a vector during initialization (so you just `push_back`) and then later convert it to `map` for fast lookup (or sort the vector) — Shahbaz, May 29 '12 at 11:03
By side-links, I think you mean a [threaded tree](https://en.wikipedia.org/wiki/Threaded_binary_tree). — Fred Foo, May 29 '12 at 11:37
@larsmans, that's exactly what I meant. I couldn't remember how they named it — Shahbaz, May 29 '12 at 12:06

score 3 · Answer 2 · answered May 29 '12 at 10:20

KISS

Start with the simplest solution, that is wrap std::map<size_t, YourData> into a specific sparse vector structure:

template <typename V>
class SparseVector {
public:

private:
    std::map<size_t, V> specificValues;
    V defaultValue;
};

The map will give good default performances in all use cases (find/insert/update) and the use of a custom class means that if you ever need to change to another implementation, you will not have to change the clients sources, just recompile.

Suggestions for Sparse Vector implementation without using outside libraries

2 Answers2