7

I have two sparse vectors X and Y and want to get the dot product in O(m+n) where m and n are the numbers of non-zero elements in X and Y. The only way I can think of is picking each element in vector X and traverse through vector Y to find if there is element with the same index. But that would take O(m * n). I am implementing the vector as a linked list and each node has an element.

stgatilov
  • 5,333
  • 31
  • 54
Altaïr
  • 123
  • 2
  • 2
  • 9
  • 2
    Are the lists already sorted by the index? – A.S.H Sep 24 '15 at 04:33
  • @A.S.H m and n aren't the sizes of the vectors, but the number of non-zero elements in each vector – Jens Schauder Sep 24 '15 at 04:33
  • Yes i saw that after edit. But still, the answers are assuming that the lists are sorted by the index. This needs to be clearly stated I think, – A.S.H Sep 24 '15 at 04:34
  • I suggest switching from linked lists to arrays if you worry about performance. – stgatilov Sep 24 '15 at 05:42
  • @stgatilov I can't switch. It is an assignment and has to be linked lists implementation. – Altaïr Sep 24 '15 at 05:45
  • @stgatilov arrays are a very bad idea for sparse data. – Jens Schauder Sep 24 '15 at 06:15
  • @JensSchauder: You are absolutely wrong here. Look at [sparse formats in MKL](https://software.intel.com/ru-ru/node/522243#MKL_APPA_SMSF_2). If you can find any linked list there, please let me know =) Sparse data are stored in arrays to reduce memory overhead, reduce memory fragmentation, and reduce low-level time needed to iterate over elements. – stgatilov Sep 24 '15 at 06:20
  • 1
    @stgatilov ok, I see what you mean. I understood your comment to mean the naive storage in an array, which would result in lots of zeros stored. – Jens Schauder Sep 24 '15 at 06:40

3 Answers3

11

You can do it if your vectors are stored as a linked list of tuples whith each tuple containing the index and the value of a non zero element and sorted by the index.

You iterate through both vectors, by selecting the next element from the vector where you are at the lower index. If the indexes are the same you multiply the elements and store the result.

Repeat until one list reaches the end.

Since you have one step per non zero element in each list, the complexity is O(m+n) as required.

Footnote: The datastructure doesn't have to be linked list, but must provide a O(1) way to access the next non 0 element and it's index.

Jens Schauder
  • 77,657
  • 34
  • 181
  • 348
  • @A.S.H I tried to explain it in my edit but it seems rather obvious to me that it is O(m+n) – Jens Schauder Sep 24 '15 at 04:32
  • But what if I have a vector with more elements than the other ? For instance X = [(15,3.0), (1500,3.14) , (15000, 3.141), (150000, 3.1415)] and Y = [(15,1.0), (150000,1.0)] I will run out of elements in Y while iterating through X – Altaïr Sep 24 '15 at 04:36
  • @Jens, i agree, but you are assuming that the vector lists are sorted by the index. This is not clearly stated by the OP :) – A.S.H Sep 24 '15 at 04:37
  • I can't see how this is O(m+N). How is it gonna be done on vectors with different number of non-zero elements ? You can't iterate through both at the sametimes – Altaïr Sep 24 '15 at 05:21
  • @A.S.H you are right about the sorting, no idea why your edit was rejected. – Jens Schauder Sep 24 '15 at 06:18
  • @JensSchauder no worries, I suggest you edit it yourself, so that it becomes complete. – A.S.H Sep 24 '15 at 06:31
  • 1
    @Altaïr you only advance the list where you are currently at the lower index, until you reach the end of one of the lists. – Jens Schauder Sep 24 '15 at 06:42
0

Sorted lists

Given that your nonzero elements are sorted by coordinate index in both vectors, it is achieved by merge algorithm. That is a standard algorithm in computer science, which merges two sorted sequences into one sorted sequence, and it works in O(M + N).

There are two ways to do it. The first one is to check for equal elements inside merge. And it is indeed the best way.

The second way is to merge first, then check for equals (they must be consecutive then):

std::pair<int, double> vecA[n], vecB[m], vecBoth[n+m];
std::merge(vecA, vecA+n, vecB, vecB+m, vecBoth);
double dotP = 0.0;
for (int i = 0; i+1 < n+m; i++)
  if (vecBoth[i].first == vecBoth[i+1].first)
    dotP += vecBoth[i].second * vecBoth[i+1].second;

Complexity of std::merge is O(M + N).

Example above assumes that the data is stored in arrays (which is the best choice for sparse vectors and matrices). If you want to use linked lists, you can also perform merge in O(M + N) time, see this question.

Unsorted lists

Even if your lists are unsorted, you can still perform dot product in O(M + N) time. The idea is to put all the elements of A into hash table first, then iterate through elements of B and see if there is an elements in hash with same index.

If indices are very large (e.g. more than million), then perhaps you should really use a nontrivial hash function. However, if your indices are rather small, then you can avoid using hash function. Simply use array of size greater than dimension of your vectors. In order to clear this array fast, you can use the trick with "generations".

//global data! must be threadlocal in case of concurrent access
double elemsTable[1<<20];
int whenUsed[1<<20] = {0};
int usedGeneration = 0;

double CalcDotProduct(std::pair<int, double> vecA[n], vecB[m]) {
  usedGeneration++;   //clear used array in O(1)
  for (int i = 0; i < n; i++) {
    elemsTable[vecA[i].first] = vecA[i].second;
    whenUsed[vecA[i].first] = usedGeneration;
  }
  double dotP = 0.0;
  for (int i = 0; i < m; i++)
    if (whenUsed[vecB[i].first] == usedGeneration)
      dotP += elemsTable[vecB[i].first] * vecB[i].second;
  return dotP;
}

Note that you might need to clear whenUsed once per billion dot products.

Community
  • 1
  • 1
stgatilov
  • 5,333
  • 31
  • 54
  • We still didn't cover yet merge algorithms in class. one person before you suggested that I iterate through both vectors at the same time and check if the indices are equal. if they are multiply the values and store the result. But I don't how this will work. What if i have two vectors with different number of elements. For instance X = [(15,3.0), (1500,3.14) , (15000, 3.141), (150000, 3.1415)] and Y = [(15,1.0), (150000,1.0)] I will run out of elements in Y while iterating through X and I will null pointer exception. Isn't that right ? – Altaïr Sep 24 '15 at 05:49
  • @Altaïr: Since merge is a well-known algorithm, you can find explanation yourself. Here is a [similar question](http://stackoverflow.com/questions/11039217/time-complexity-for-merging-two-sorted-arrays-of-size-n-and-m), here is [algolist article](http://www.algolist.net/Algorithms/Merge/Sorted_arrays), here is [some question](http://www.cplusplus.com/forum/general/73264/), etc. You can see some code [here](http://stackoverflow.com/a/18142545/556899) in context of merge sort, it shows what to do when one sequence ends first. P.S. Added solution for unsorted case. – stgatilov Sep 24 '15 at 06:05
0
 Use Map to store each vector.
 Each entry of map has index as key and value as the vector value at the particular index. Insert only the non zero values 
 Iterate on one map and for each entry check whether the particular key is present in the other map.If yes update the product else ignore the current key
 Time Complexity :  n -> vector size
                   O(n) - for map construction 
                   O(n) - for iteration 
 Space Complexity :  O(n) - for maps
lokecodes
  • 1
  • 1
  • 1