1

So, I have a weird problem. I've got a vector of checksums, essentially, sorted by time, and I'm looking to identify any checksum that is between two matching checksums. In other words, if we had:

std::vector < std::string > checksums = {"foo","bar","baz","foo"};

I'd want it to report FALSE, TRUE, TRUE, FALSE, in order.

How the heck do I efficiently go about doing this, given that each checksum needs to be checked against every /other/ checksum? The hashtable example here, for identifying duplicate elements, is potentially one part of the puzzle, but it sort of does the opposite of what I'm looking for.

Community
  • 1
  • 1
Oliver Keyes
  • 3,294
  • 2
  • 14
  • 23
  • The naive algorithm for this would have quadratic complexity. Do you think that's too much? – eerorika Dec 27 '14 at 00:02
  • 2
    Use the hashtable to find duplicate elements, and put the vector index as the value in the table. Then when you detect a duplicate, you can put `TRUE` in the result for all the indices in between. – Barmar Dec 27 '14 at 00:02
  • @Barmar sounds workable; I suspected hashtables would be the answer but I'm largely (sadly) ignorant of them. Can you provide an example? – Oliver Keyes Dec 27 '14 at 00:04
  • Why is the second output element TRUE? It seems "bar" is _not_ between "foo" and "baz"? – Nicu Stiurca Dec 27 '14 at 00:06
  • Er. It's not about whether it's between "foo" and "baz". It's TRUE because it's between the two "foo" sums. – Oliver Keyes Dec 27 '14 at 00:07
  • If some checksum occurs more than 2x... is everything between the first an last occurrence TRUE, or only everything between 1st and 2nd, _not_ between 2nd and 3rd, between 3rd and 4th, etc.? – Nicu Stiurca Dec 27 '14 at 00:12
  • The latter - exactly the reason this is headache-inducing for me! Sorry I didn't make that clear. – Oliver Keyes Dec 27 '14 at 00:15
  • Just to make sure: you don't know up front, if a checksum is part of a matching pair or not - correct? Also: Are the ranges possibly overlapping? – MikeMB Dec 27 '14 at 00:33
  • No and no, respectively. Sam's answer appears to work with a few fiddles. – Oliver Keyes Dec 27 '14 at 00:34
  • One other thing, if you are concerned about runtime: Do you really have to store thr checksum as a string (instead of e.g. a std::array?) – MikeMB Dec 27 '14 at 00:36
  • They come in as a string, so I'd have to split each one if I wanted to take that approach. – Oliver Keyes Dec 27 '14 at 00:49

1 Answers1

3

I'll outline a basic approach:

std::map<std::string, size_t> position;

for (size_t i=0; i<checksums.size(); ++i)
{
     std::map<std::string, int>::iterator prev=position.find(checksums[i]);

     if (prev == position.end())
     {
         position.insert(std::make_pair(checksums[i], i));
     }
     else
     {
         size_t j=prev->first;
         positions.erase(prev);

         // Everything between index #j and index #i
     }
}

I haven't tested if this compiles, if not, it's a minor typo somewhere.

You did not specify how ambiguous situations are handled. I.e., a list of {"foo", "bar", "baz", "foo", "foobar", "foo"} -- whether you're looking for the entries just before the 1st and the 2nd foo, or the 1st and the 3rd foo. My proposed solution, of course, assumes the former, but it wouldn't be too hard to adjust it for the latter.

Also, some further refinements on the algorithms are also possible. For example, it's possible to combine the find/compare/insert operation into a single operation. How to do that, is going to be your homework assignment. Hint: look at the return value of std::map::insert().

EDIT: this finds the matching ranges only. Turning the result into, essentially, a vector of false/true values is not very complicated. You can just initialize the vector to all false values, and when you find the matching range, set every element after j and before i to true.

Sam Varshavchik
  • 114,536
  • 5
  • 94
  • 148
  • 1
    FWIW I'd be looking for things between foo1 and foo2, and foo3 and foo4, but not foo2 and foo3. I'll work on this basis and see where it takes me; it's certainly a lot better as a starting point than my existing mix of confusion and...no, just confusion. – Oliver Keyes Dec 27 '14 at 00:20
  • Consider replacing map with unordered map, as it is often faster. – MikeMB Dec 27 '14 at 00:28
  • It's also >=C++11 only, and a restriction of my platform is that C++11 is still fairly new. But a good suggestion, and one I'll investigate when 11 usage is more integrated :) – Oliver Keyes Dec 27 '14 at 00:33
  • The code sample I posted should not need any changes to find foo1-foo2 and foo3-foo4. – Sam Varshavchik Dec 27 '14 at 01:39