1

I need to implement an algorithm in C++ that, when given three arrays of unequal sizes, produces triplets a,b,c (one element contributed by each array) such that max(a,b,c) - min(a,b,c) is minimized. The algorithm should produce a list of these triplets, in order of size of max(a,b,c)-min(a,b,c). The arrays are sorted.

I've implemented the following algorithm (note that I now use arrays of type double), however it runs excruciatingly slow (even when compiled using GCC with -03 optimization, and other combinations of optimizations). The dataset (and, therefore, each array) has potentially tens of millions of elements. Is there a faster/more efficient method? A significant speed increase is necessary to accomplish the required task in a reasonable time frame.

void findClosest(vector<double> vec1, vector<double> vec2, vector<double> vec3){

 //calculate size of each array
    int len1 = vec1.size();
    int len2 = vec2.size();
    int len3 = vec3.size();

    int i = 0; int j = 0; int k = 0; int res_i, res_j, res_k;
    int diff = INT_MAX;

    int iter = 0; int iter_bound = min(min(len1,len2),len3);
    while(iter < iter_bound)
        while(i < len1 && j < len2 && k < len3){

            int minimum = min(min(vec1[i], vec2[j]), vec3[k]);
            int maximum = max(max(vec1[i], vec2[j]), vec3[k]);

            //if new difference less than previous difference, update difference, store
            //resultants
            if(fabs(maximum - minimum) < diff){ diff = maximum-minimum; res_i = i; res_j = j; res_k = k;}

            //increment minimum value
            if(vec1[i] == minimum) ++i;
            else if(vec2[j] == minimum) ++j;
            else ++k;

    }

    //"remove" triplet
    vec1.erase(vec1.begin() + res_i);
    vec2.erase(vec2.begin() + res_j);
    vec3.erase(vec3.begin() + res_k);

    --len1; --len2; --len3;

    ++iter_bound;

}
10GeV
  • 453
  • 2
  • 14
  • 2
    This reads like it's from some contest/challenge/competitive coding/hacking site. Is it? If your goal is to learn C++, you won't learn anything there. In nearly all cases, like this one, the correct solution is based on a mathematical or a programming trick. If you don't know what the trick is and attempt to code a brute-force approach, the program runs slow and fails for that reason. If you're trying to learn C++, you won't learn anything from meaningless online contest sites [but only from a good C++ textbook](https://stackoverflow.com/questions/388242/the-definitive-c-book-guide-and-list). – Sam Varshavchik Aug 13 '20 at 19:02
  • No, this is for a personal project involving data analysis. However, I am unable to analyze the data efficiently as this runs for many, many hours. – 10GeV Aug 13 '20 at 19:02
  • 3
    Well, in addition to running slow, it is competely broken. `sizeof(arr1)`, in `findClosest()`, neither the other two `sizeof`s, actually do what you think they do. You will be surprised to learn that these return the same value no matter how big the actual arrays that get passed in. Maybe that's why it runs slow. Try checking what these `sizeof`s gives you, print their values. You'll be unpleasantly surprised. So, before you start working on optimizing, you should fix all the bugs, first. – Sam Varshavchik Aug 13 '20 at 19:04
  • If the array sizes aren't necessarily equal, what do you do with the leftover? Or (assuming you have arrays A, B, C) are you just looking for `min(len(A), len(B), len(C)` number of triplets? – scohe001 Aug 13 '20 at 19:06
  • I'm not concerned with the leftover. Also, @SamVarshavchik, this line seems to produce the correct numbers in a couple simple tests (and is also the method found in many textbooks). Is there a "better" or "correct" way? – 10GeV Aug 13 '20 at 19:07
  • can the numbers be negative? once given 3 different arrays, how many triplets do you want? – Breakpoint Aug 13 '20 at 19:09
  • There is certainly a correct way, because the above is guaranteed to be 100% incorrect. Did you take my advice, to look at what this function thinks the size of each array is? Free clue: it's not the actual size of each array, as long as each array is at least three values. Pop quiz: what does `std::cout << len1;` gives you, when you pass in an array of, oh, say a hundred values for `arr1`? I will guarantee you that it will not be 100. Like I wrote: `sizeof` does not do what you think it does, here. – Sam Varshavchik Aug 13 '20 at 19:09
  • None of the elements are negative. I'm not concerned with the number of triplets, I simply want to pair as many as possible (there will, of course, be some leftover due to the fact the triplets are of unequal size). – 10GeV Aug 13 '20 at 19:12
  • Keith You will come to fear [Undefined Behaviour](https://en.cppreference.com/w/cpp/language/ub). One of the infinite possible outcomes of UB is exactly what you expected.And then you go on stage with your boss at a trade show and [his happens](https://www.youtube.com/watch?v=73wMnU7xbwE) – user4581301 Aug 13 '20 at 19:13
  • How can you have a lot of triplets if the arrays don't have the same size (hence it's not guaranteed that `(|arr1| + |arr2| + |arr3|) % 3 == 0`)? – Daniel Aug 13 '20 at 19:13
  • 1
    Side note: You never need fastest. Fast enough is good enough and usually a lot less effort. – user4581301 Aug 13 '20 at 19:15
  • Is there a reason you don't use `std::vector` for this? – Thomas Sablik Aug 13 '20 at 19:18
  • If there is a simpler method using `std::vector`, it isn't impossible to use it. – 10GeV Aug 13 '20 at 19:19
  • The new method to get the number of elements is still wrong. Did you test it? It's impossible to get the number of elements in the function. You have to pass it some way. On way is to use `std::vector`. Another way is to pass 3 parameters containing the sizes. – Thomas Sablik Aug 13 '20 at 19:20
  • Were you typing this code in? You are mixing `min` and `max` with `minimum` and `maximum` in `f(fabs(max - min) ` and below – Vlad Feinstein Aug 13 '20 at 19:26
  • @VladFeinstein Sorry! I typed this in too quickly, I guess. This error is not in the actual code. I'm correcting that now. – 10GeV Aug 13 '20 at 19:27
  • Can an element be reused in multiple triplets? – btilly Aug 13 '20 at 19:28
  • @btilly No, each triplet must be "unique" (i.e. no two triplets may share an element), hence the "removal" step. – 10GeV Aug 13 '20 at 19:30
  • what is the relationship between the triplet elements a, b and c? – John Aug 13 '20 at 20:03

1 Answers1

1

OK, you're going to need to be clever in a few ways to make this run well.

The first thing that you need is a priority queue, which is usually implemented with a heap. With that, the algorithm in pseudocode is:

Make a priority queue for possible triples in order of max - min, then how close median is to their average.
Make a pass through all 3 arrays, putting reasonable triples for every element into the priority queue
While the priority queue is not empty:
    Pull a triple out
    If all three of the triple are not used:
        Add triple to output
        Mark the triple used
    else:
        If you can construct reasonable triplets for unused elements:
            Add them to the queue

Now for this operation to succeed, you need to efficiently find elements that are currently unused. Doing that at first is easy, just keep an array of bools where you mark off the indexes of the used values. But once a lot have been taken off, your search gets long.

The trick for that is to have a vector of bools for individual elements, a second for whether both in a pair have been used, a third for where all 4 in a quadruple have been used and so on. When you use an element just mark the individual bool, then go up the hierarchy, marking off the next level if the one you're paired with is marked off, else stopping. This additional data structure of size 2n will require an average of marking 2 bools per element used, but allows you to find the next unused index in either direction in at most O(log(n)) steps.

The resulting algorithm will be O(n log(n)).

btilly
  • 43,296
  • 3
  • 59
  • 88
  • This is quite clever! Could you clarify how one would go about implementing the second line ("make a pass through..") efficiently? – 10GeV Aug 13 '20 at 20:17
  • @KeithMadison Iterate through all 3 arrays in parallel, advancing whichever was smallest, giving you the max and min of a triple. From the third array put the entry in that is closest to the median. One pass will give you `O(n)` reasonable triples to start with. – btilly Aug 13 '20 at 20:38
  • Isn't the size of the output O(n^3) triplets? – Dave Aug 14 '20 at 14:04
  • 1
    @Dave That was my first question. But elements can't be reused, which limits the max to `O(n)`. Which is also why we need a fancy data structure to track which elements have been used already. – btilly Aug 14 '20 at 16:07