0

I am trying to perform a parallel graph traversal. The entire code has only one shared variable (writable) - i would like to know why the memory consumption is too much for this. As per my understanding the memory should be

  size of local variables in bfs  * no of threads + ~some overheads

But i get the values much more than this.

./d is the executable compiled using:
g++ program.cpp -I . -o d -std=c++11 -Ofast -fopenmp

Am i doing something wrong ?

Compiled using gcc/9.2.0

g++ demo.cpp -I . -o d -std=c++11 -O2 -fopenmp

#include <iostream>
#include <queue>
#include <vector>
#include <set>
#include <algorithm>
#include <climits>
#include <cmath>
#include <iomanip>
#include <map>
#include <map>
#include <tuple>
#include <sstream>
#include <unordered_set>

#include "omp.h"

using namespace std;


static map<int,std::vector<int>> g_adjLst;
static map<int,std::vector<int>> g_pairLst;
static int g_vertex;

void setSize(int nodesLen) ;
void addEdge(const int src, const int dst);
void print();

int findShortestPath(int src,int dst);
string bfs(int src, const std::vector<int>& , int thread) ;

int getSize() ;
int getPairSize() ;
void clearPair () ;
void storePairForPathFinding(int src, int dst) ;
string runBFS();

void setSize(int nodesLen) {
    g_vertex = nodesLen;
}

string bfs( int src, const std::vector<int>& dst ,  int perc) {

    std::stringstream msg;

    if (g_adjLst.find(src) == g_adjLst.end()) {
        return  "";
    }

    std::vector<int>  dist (g_vertex,INT_MAX);
    std::vector<bool> visited(g_vertex,false);

    std::queue <int> q;
    q.push(src);
    visited.at(src) = true;
    dist.at(src)= 0;

    while (!q.empty()) {
        int size = q.size();
        while (size--) {
            int curr = q.front();
            q.pop();
            if (g_adjLst.find(curr) == g_adjLst.end()) {
                return "";
            }
            for (vector<int> ::const_iterator it = g_adjLst[curr].begin();  it != g_adjLst[curr].end(); ++it) {
                if (visited.at(*it)) {continue;}

                if (dist.at(*it)> dist.at(curr) +1) {
                    dist.at(*it)= dist.at(curr) + 1;
                    q.push(*it);
                }
                visited.at(*it) = 1;
            }
        }
    }

    stringstream s;
    for (std::vector<int> ::const_iterator it =  dst.begin() ; it != dst.end(); ++it) {
        s << " {" << src << "," << *it << "} " << dist[*it] ;
    }
    return s.str();
}

//
void storePairForPathFinding (int src, int dst) {
    g_pairLst[src].push_back(dst);
}

//
//
string runBFS() {

    int  i,tid=0,nthreads;

    vector<int> ::iterator ip ;
    for (std::map<int,std::vector<int>> ::iterator it = g_pairLst.begin(); it != g_pairLst.end() ;++it) {
        ip = std::unique(it->second.begin() , it->second.end());
        it->second.resize(std::distance(it->second.begin(),ip));
    }

    std::string totalStr  = "" ,partialStr = "";

    int total = g_pairLst.size();
    #pragma omp parallel for private(tid,nthreads,partialStr) shared (totalStr)
    for (i = 0 ; i < g_pairLst.size(); i++) {
        auto it = g_pairLst.begin();
        advance(it,i);
        tid = omp_get_thread_num();
        partialStr = bfs(it->first,it->second,int(i*100/total));

        // Create thread safe region.
        #pragma omp critical
        {
                //add each threads partial sum to the total sum
                totalStr += partialStr;
        }
    }
    return totalStr; 
}

//
//
void addEdge(int src, int dst) {
    g_adjLst[src].push_back(dst);
}
int main () {
  setSize(60000);


for(int i = 0; i < 60000;i++) {
    for (int j=0; j< 60000; j++) {
        addEdge(i,j);
        if (i==j) {
            storePairForPathFinding(0,i);
        }
    }
}

    string s = runBFS();

    cout << "Ans1 = " << s << endl;

    string e = runBFS();
    cout << "Ans2 = " << e << endl;  
}

runBFS calls the pair of bfs needs to be done - foreach src and destinationLst (stored in g_pairLst) bfs is done separately/parallely. Then all the paths are concatenated in a string from each thread (partialStr) and then finally concatenated at the end (totalStr) - like summing up array .

kil47
  • 53
  • 1
  • 9
  • 3
    I'll be surprised if this is what you've run into, but you owe it to yourself to read [What are the rules about using an underscore in a C++ identifier?](https://stackoverflow.com/q/228783/4581301) before you waste hours because you broke the naming rules. – user4581301 May 16 '23 at 20:36
  • In general, in a BFS you need to mark nodes so not to travel them twice. This require a synchronization mechanism. I see no such synchronization in your code which is suspicious. My conclusion is that either this is not a true BFS but a basic recursive tree traversal with independent nodes or the parallel implementation has a race condition somewhere (if so, the memory consumption is the least of your problem). – Jérôme Richard May 16 '23 at 20:57
  • @user4581301 removed the _ – kil47 May 16 '23 at 20:57
  • @JérômeRichard since the traversal variables (dist, visited and q) belong locally to the method - each thread would get separate copy of the same ? Why do you think here race condiion happens ? – kil47 May 16 '23 at 20:59
  • Actually, I have a hard time understanding the code. `runBFS` isn't supposed to run a BFS? It looks like it perform many independent ones. What confuse me even more is that you have only one graph so I expect 1 BFS to be done. I do not expect `visited` to be a thread-private variable but a shared one if 1 BFS is done. Otherwise, each thread will travel the same nodes. By the way, there is no methods in this code, just functions and in fact there is no methods in C++ but "member function" ;) . – Jérôme Richard May 16 '23 at 21:04
  • @JérômeRichard -> runBFS calls the pair of bfs needs to be done - foreach src and destinationLst (stored in PAIRLST) bfs is done separately/parallely. Then all the paths are concatenated in a string from each thread and then finally concatenated at the end - just like summing up array . I agree- there is no method- misspelled in the comment – kil47 May 16 '23 at 21:08
  • Good plan. But, one more recommendation: Change the `ALL_CAPS` identifiers to something else. By long-standing convention, and many coding standards have codified it into divine law , `ALL_CAPS` identifiers are reserved for `#define` macros to warn other programmers that they are looking at a macro and need to worry about unintended consequences of text substitution. – user4581301 May 16 '23 at 21:10
  • @user4581301 taken care – kil47 May 16 '23 at 21:18
  • 1. Do you really need this `::iterator` loops? A simple range-based for would look far neater. 2. Having `shared (totalStr)` and then a `critical` is bad. Use a `reduction(+:totalStr)`. 3. why do you declare every loop variable in the header but the `for (i....` in the omp paralllel loop? Please that one too. 4. In fact, `private(tid,nthreads,partialStr)` all those should be local variables, as far as I can see. – Victor Eijkhout May 17 '23 at 00:23
  • @VictorEijkhout : reduction on string is not supported hence the critical section. others i can modify – kil47 May 17 '23 at 06:06
  • @kil47 OpenMP reduction works on anything that has `operator+` defined, including strings. Upgrade your compiler. It works for me. – Victor Eijkhout May 17 '23 at 09:14
  • @VictorEijkhout :I would also think the same ; i am using gcc/9.2.0 .. any other version you recommend ? Also i don;t think this would reduce memory problem.. performance might be improved more.. but my initial problem remains same – kil47 May 17 '23 at 09:23
  • @kil47 In general I have good luck with gcc 12 but my surprise it doesn't accept this. I mostly use Intel compilers and the OneAPI compiler translates this correctly. – Victor Eijkhout May 17 '23 at 09:27
  • reduction aside.. any hints on the memory usage ? – kil47 May 17 '23 at 09:54

0 Answers0