22

I'm working with some binary data that I have stored in arbitrarily long arrays of unsigned ints. I've found that I have some duplication of data, and am looking to ignore duplicates in the short term and remove whatever bug is causing them in the long term.

I'm looking at inserting each dataset into a map before storing it, but only if it was not found in the map to start with. My initial thought was to have a map of strings and use memcpy as a hammer to force the ints into a character array, and then copy that into a string and store the string. This failed because a good deal of my data contains multiple bytes of 0 (aka NULL) at the front of the relevant data, so a majority of very real data got thrown out.

My next attempt is planned to be std::map<std::vector<unsigned char>,int>, but I'm realizing that I don't know if the map insert function will work.

Is this doable, even if ill advised, or is there a better way to approach this problem?

Edit

So it's been remarked that I didn't make clear what I'm doing, so here's a hopefully better description.

I'm working on generating a minimum spanning tree, given that I have a number of trees containing the actual end nodes I'm working with. The goal is to come up with the selection of trees that has the shortest length and that covers all of the end nodes, where the chosen trees share at most one node with each other and are all connected. I'm basing my approach off of a binary decision tree, but making a few changes to hopefully allow for greater parallelism.

Rather than taking the binary tree approach I've opted to make a bit vector out of unsigned integers for each dataset, where a 1 in a bit position indicates the inclusion of the corresponding tree.

For example if just tree 0 were included in a 5 tree dataset I would start with

00001

From here I can generate:

00011

00101

01001

10001

Each of these can then be processed in parallel, since none of them depend on each other. I do this for all of the single trees (00010, 00100, etc..) and should, I haven't taken the time to formally prove it, be able to generate all values in the range (0,2^n) once and only once.

I started to notice that many datasets were taking far longer to complete than I thought they should, and enabled a debugging output to look at all of the generated results, and a quick Perl script later it was confirmed that I had multiple processes generating the same output. Since then I've been trying to resolve where the duplicates are coming from with very little success, and I'm hoping that this will work well enough to let me verify the results that are being generated without the, sometimes, 3 day wait on computations.

jthecie
  • 395
  • 1
  • 2
  • 7

4 Answers4

20

You will not have problems with that, as std::vector provides you the "==", "<" and ">" operators:

http://en.cppreference.com/w/cpp/container/vector/operator_cmp

Renan Greinert
  • 3,376
  • 3
  • 19
  • 29
  • 1
    But you need a less-than operator to be used as a key in map. I guess you could provide the comparison as a template argument however. – Brian Neal Jan 18 '12 at 00:56
  • 1
    vector also provides that, as shown in the link. I will edit my answer to make it cleaner. Thanks for the observation. – Renan Greinert Jan 18 '12 at 00:58
12

The requirements for being a key in std::map are satisfied by std::vector, so yes you can do that. Sounds like a good temporary solution (easy to code, minimum of hassle) -- but you know what they say: "there is nothing more permanent than the temporary".

Community
  • 1
  • 1
Jon
  • 428,835
  • 81
  • 738
  • 806
  • What I'm really hoping to see is if I still get the duplicate results after adding this in. That would narrow down the search to I'm unintentionally storing or fetching duplicates rather than generating them. – jthecie Jan 18 '12 at 21:34
6

That should work, as Renan Greinert points out, vector<> meets the requirements to be used as a map key.

You also say:

I'm looking at inserting each dataset into a map before storing it, but only if it was not found in the map to start with.

That's usually not what you want to do, as that would involve doing a find() on the map, and if not found, then doing an insert() operation. Those two operations would essentially have to do a find twice. It is better just to try and insert the items into the map. If the key is already there, the operation will fail by definition. So your code would look like this:

#include <vector>
#include <map>
#include <utility>

// typedefs help a lot to shorten the verbose C++ code
typedef std::map<std::vector<unsigned char>, int> MyMapType;

std::vector<unsigned char> v = ...; // initialize this somehow
std::pair<MyMapType::iterator, bool> result = myMap.insert(std::make_pair(v, 42));
if (result.second)
{
   // the insertion worked and result.first points to the newly 
   // inserted pair
}
else
{
   // the insertion failed and result.first points to the pair that
   // was already in the map
}
Brian Neal
  • 31,821
  • 7
  • 55
  • 59
  • I've updated the original question with more detail on what I'm doing, and hopefully that helps clear up my motivation for this. I honestly didn't realize that the find could be skipped like this, thanks for the awesome new thing I can use in STL. – jthecie Jan 18 '12 at 22:09
  • Oops, I just now fixed the comment for the "key was already in the map case". If the key was already in the map, then `result.second` will be false, and `result.first` will point to the existing key & value pair. – Brian Neal Jan 19 '12 at 00:52
0

Why do you need a std::map for that? Maybe I miss some point but what about using an std::vector together with the find algorithm as examplained here?

This means, that you append your unsigned ints to the vector and later search for it, e.g.

std::vector<unsigned int> collector; // vector that is substituting your std::map
for(unsigned int i=0; i<myInts.size(); ++i) {  // myInts are the long ints you have
    if(find(collector.begin(), collector.end(), myInts.at(i)==collector.end()) {
         collector.push_back(myInts.at(i));
    }
}
ezdazuzena
  • 6,120
  • 6
  • 41
  • 71
  • 1
    That works fine if I can fit everything into one unsigned int. The problem is that once I get a sufficiently large input I have to start spilling over into multiple integers, and have to preserve them as the whole collection for the searching. I was really hoping to do this without actually making a struct to deal with this as it's hopefully a very temporary hack. – jthecie Jan 18 '12 at 22:03