1

I have a large vector (mainvect) of struct info objects (about 8million element), and I want to remove duplicates, the struct consist of pid, and uid .

 struct info
 {
    int pid;
    string uid;

 }

I have another vector (vect1) which contain information of each pid and its occurrence in mainvect (its help in search specific indices not all the main vect) size of vect1 is 420k elements

struct pidInfo
 {
    int pid;
    int numofoccurence;
}

I want to store unqiue elements in mainvect in vect2.

 .
 .
// sort mainvect based on pid
sort(mainvect.begin(), mainvect.end(), sortByPId());

int start = 0;
int end = 0;
vector <string> temp;  // to store uids with a specific pid 

for (int i = 0; i < vect1.size(); i++)
{
   end = end + vect1[i].numofoccurence;

    for (int j = start; j < end; j++)
    {
        temp.push_back(mainvect[j].uid);
    }

    start = start + vect1[i].numofoccurence;

    // remove duplicate uid 
    sort(temp.begin(), temp.end());
    temp.erase(unique(temp.begin(), temp.end()), temp.end());

    // push remaining unique uids 
    for (int k = 0; k < temp.size(); k++)
    {
        info obb;
        obb.pid = vect1[i].pid;
        obb.uid = temp[k];
        vect2.push_back(obb);
    }

    // empty the temp vector to use in next i iteration
    temp.erase(temp.begin(), temp.end());

}
 .
 .

But when I run the code, it gave me exception as shown in the following figure enter image description here

userInThisWorld
  • 1,361
  • 4
  • 18
  • 35
  • You ran out of memory. – Justin Dec 12 '17 at 20:34
  • @Justin the reason is the vector size is large? or a logical error in the code? – userInThisWorld Dec 12 '17 at 20:38
  • Did you know you can replace `temp.erase(temp.begin(), temp.end());` with `temp.clear()`? Also what is the actual size of data you are trying to allocate and actual sizes of those vectors? – user7860670 Dec 12 '17 at 20:39
  • @VTT actual sizes: mainvect (8 million), vect1(420k) ... but vect2 will be less than 8 million since duplicates will excluded from mainvect – userInThisWorld Dec 12 '17 at 20:42
  • I mean sizes of those vectors at the moment of exception throw. And which exact allocation is triggering this exception? Also are you sure that you are population all those strings properly? I mean if you are copying content from some not-null terminated buffer sizes of those strings can be dramatically large. – user7860670 Dec 12 '17 at 20:43
  • @VTT mainvect data is extracted from a text file, and vect1 produced from a code count the occurrence of each pid in mainvect ... but I tried to debug but I wasn't able to know the sizes when the exception occurred – userInThisWorld Dec 12 '17 at 20:48
  • How is that? You should be able to navigate back at call stack. – user7860670 Dec 12 '17 at 20:50
  • @VTT I will re-run it, but it will take a long time (more than one hour) and back – userInThisWorld Dec 12 '17 at 20:52
  • If you are using an IDE (and it looks as if you are), you should be able to set "break on throw", which may help. – Eljay Dec 12 '17 at 22:46

3 Answers3

3

You are running out of memory. There are few things you can do:

  1. You are building 32-bit program on Windows. This means you have only 2 GB of RAM available, regardles of how much physical memory have on your machine. You should build your program for 64-bit architecture to get access to all RAM you have. To do this you need to create of new configuration for your project with platform set to x64.
  2. You should be smarter about using your memory. The quickest thing you can do is to replace std::vector with std::dequeu for large vectors.

The problem with std::vector is that every time it grows it allocates a new memory chunk and copies all data. MSVC implementation you are using grows vector by factor of 1.5 each time. So if vector takes 1 GB of memory, next time it resizes it will try to allocate 1.5 GB, taking 2.5 GB of RAM in total while resizing is in progress.

Implementations of std::deque will usually allocate memory in smaller chunks, so it will have less problem with resizing.

Another thing you should pay attention to is std::string. MSVC implementation uses SSO (Small-String-Optimization). Every instance ofstd::string` afair takes 32 bytes on x86. So every element in your 8 million elements vector might or might not be wasting this memory.

Depending on how much time you want to spend on your program, you might want to learn about memory-mapped files.

  • Before recommending `std::deque`, I'd recommend seeing if you know the sizes of the vectors beforehand. If you do, then use `vector.reserve(size)` – Justin Dec 12 '17 at 21:06
  • @Justin Since OP runs his program in 32-bit, he/she is likely to run out of memory regardless of data strcutures. –  Dec 12 '17 at 21:09
  • True, but `std::vector` is almost always the superior container – Justin Dec 12 '17 at 21:10
  • @Justin, Not really, the worst one if number of elements is huge. `std::deque` won't be very good too, but I think this is the simplest solution using standard library. –  Dec 12 '17 at 21:12
  • Even with assumption that only 2Gb of memory available it leaves ~250 bytes per each of 8000000 items. That should be sufficient. – user7860670 Dec 12 '17 at 21:18
  • @VTT There are other vectors too, plus there might be data we are not seeing in the code, probably lots of it. We don't know how much space those strings take - they might be large. –  Dec 12 '17 at 21:20
  • Also when it comes to memory exhaustion, it fails the first time it can't get a **continuous** block of memory. This may be way before the 2GB limit. Then again, when you are starting to see memory usage this high, it becomes time to shift the algorithm to something streamable. It may be sensible to setup a small temporary database. – rioki Dec 12 '17 at 21:49
3

I think you actually have algorithm problem. On each iteration you are sorting and leaving only unique elements intemp vector. But with this approach each iteration will add more and more duplicates into vect2. So you should sort and leave only unique elements in vect2 as well. Actually it would be probably better to utilize std::set instead of temp and vect2.

Another suggestion would be to utilize a better storage for uid if it has some sort of fixes-length format, such as GUID.

user7860670
  • 35,849
  • 4
  • 58
  • 84
1

As state above, you are running out of memory. If you really have so many elements, it may be a sensible idea to look into a small database, like sqlite.

But since the question is about C++ standard containers, you are approaching the problem a bit hamfisted. You are doing many unnecessary sorts and loops. Not only riddled with bugs, your algorithm is at least O(n^3)

Why not use one of the already sorted containers, for example std::map? You can deduplciate a list like so:

std::vector<info> input;

// copy into map
std::map<int, info> tmp;
for (info& i : mainvect) {
  tmp[i.pid] = i;
}

// copy back out
std::vector<info> output(tmp.size());
std::transform(tmp.begin(), tmp.end(), output.begin(), [] (const std::pair<int, info>& p) {
    return p.second;
});

Not only is the code cleaner, it runs at O(n + ln(n)). Or, skip the second step and use an std::map or std::set for the data in the first place.

Also if you handle a huge number of items, you don't want to use a std::vector. The key problem is that the memory for the vector needs to be one continuous piece of memory. You may want to use a deque or a list.

rioki
  • 5,988
  • 5
  • 32
  • 55