Minimal distance in Manhattan metric

Question

I am trying to find the minimal distance in the Manhattan metric (x,y). I am searching for information about this. But I haven't found anything.

#include<bits/stdc++.h>
using namespace std;
#define st first
#define nd second

pair<int, int> pointsA[1000001];
pair<int, int> pointsB[1000001];

int main() {
    int n, t;
    unsigned long long dist;

    scanf("%d", &t);

    while(t-->0) {
        dist = 4000000000LL;
        scanf("%d", &n);

        for(int i = 0; i < n; i++) {
            scanf("%d%d", &pointsA[i].st, &pointsA[i].nd);
        }

        for(int i = 0; i < n; i++) {
            scanf("%d%d", &pointsB[i].st, &pointsB[i].nd);
        }

        for(int i = 0; i < n ;i++) {
            for(int j = 0; j < n ; j++) {
                if(abs(pointsA[i].st - pointsB[j].st) + abs(pointsA[i].nd - pointsB[j].nd) < dist) {
                    dist = abs(pointsA[i].st - pointsB[j].st) + abs(pointsA[i].nd - pointsB[j].nd);
                }
            }
            printf("%lld\n", dist);
        }
    }
}

My code works in O(n^2) but is too slow. I do not know whether it will be useful but y in pointsA always be > 0 and y in pointsB always be < 0. My code compare actually distance to next and chose smallest.

for example:

input:

Output:

5
4

Also, [don't include `bits/stdc++.h`](http://stackoverflow.com/questions/31816095/why-should-i-not-include-bits-stdc-h). — sjrowlinson, Aug 27 '16 at 17:23
See that you have also discovered the [goes to zero operator](http://stackoverflow.com/questions/1642028/what-is-the-name-of-the-operator-in-c) in `while(t-->0)`. — Bo Persson, Aug 27 '16 at 17:24
Why you not cache result of calculation of manhattan distance? — fghj, Aug 27 '16 at 17:24
@Handus I mean you calc distance twice, and because of you use variables you know how to cache result. — fghj, Aug 27 '16 at 17:38
hm I don't calc distance twice, In my task pointsA != pointsB and I must found smallest distance between pointsA_i and pointsB_i — Handus, Aug 27 '16 at 17:43
@Handus you have `abs` with the same arguments, may be compiler optimize this, may be not, if you care about performance, and for simplication you should save result to variable. — fghj, Aug 27 '16 at 17:47
It won't be trivial to lower the complexity of this below O(n^2). There are certain optimizations you can make in its current form, though, such as not calculating the distance twice, or e.g. not calcuating y if the delta x is already > the current minimum distance. You could also skip the point if the current pointsA y is > min dist or current pointsB y is < -min dist, because of the assumptions you state. But to reduce it past O(n^2), hmm... maybe you could pre-load the points into a quad-tree and cleverly limit the search to each cell and its neighbors. — Jason C, Aug 27 '16 at 18:48
https://en.wikipedia.org/wiki/Closest_pair_of_points_problem — dvvrd, Aug 27 '16 at 19:14
@dvvrd I am not convinced those algorithms hold up for non euclidean distances. Are you? — Jason C, Aug 28 '16 at 15:20
@user1034749 Actually, fwiw, I tested just now, and for reasons I don't understand storing the distance result in a variable adds a whopping 20% to the total run time vs. the OP's approach in all cases (tested with 40000 points in each set, gcc w/ `-O2`: with variable = 10.5 seconds, with "two" calculations = 8.2 seconds). Somebody smarter than me can probably explain that. In any case it's not exactly like that second calculation happens that often, it only happens when a closer point is found. — Jason C, Aug 28 '16 at 16:25
@JasonC I think the main reason is `int32_t` in points, and `uint64_t` as result. If for example implement things in right way to promote difference to `int64_t`, like this `abs(int64_t(p1.x) - px2.x) + abs(int64_t(p1.y) - p2.y)` the code become slower by 40%-50%. So I think in case of caching you have to compare of `int32_t` promoted to `int64_t` and `uint64_t` vs comparing `uint64_t` vs `uint64_`. And may be `gcc` optimization pass handle such cases in different ways. — fghj, Aug 28 '16 at 16:32
@user1034749 That totally makes sense. However, in my own tests I used 32-bit `int` all around. It's very unexpected. — Jason C, Aug 28 '16 at 16:38

score 2 · Answer 1 · edited Aug 28 '16 at 11:14

My solution (note for simplicity I do not care about overflow in manhattan_dist and for that reason it does not work with unsigned long long):

#include <cstdlib>
#include <cstdio>
#include <cassert>
#include <vector>
#include <limits>
#include <algorithm>

typedef std::pair<int, int> Point;
typedef std::vector<std::pair<int, int> > PointsList;

static inline bool cmp_by_x(const Point &a, const Point &b)
{
    if (a.first < b.first) {
        return true;
    } else if (a.first > b.first) {
        return false;
    } else {
        return a.second < b.second;
    }
}

static inline bool cmp_by_y(const Point &a, const Point &b)
{
    if (a.second < b.second) {
        return true;
    } else if (a.second > b.second) {
        return false;
    } else {
        return a.first < b.first;
    }
}

static inline unsigned manhattan_dist(const Point &a, const Point &b)
{
    return std::abs(a.first - b.first) +
        std::abs(a.second - b.second);
}

int main()
{
    unsigned int n_iter = 0;
    if (scanf("%u", &n_iter) != 1) {
        std::abort();
    }
    for (unsigned i = 0; i < n_iter; ++i) {
        unsigned int N = 0;
        if (scanf("%u", &N) != 1) {
            std::abort();
        }
        if (N == 0) {
            continue;
        }
        PointsList pointsA(N);
        for (PointsList::iterator it = pointsA.begin(), endi = pointsA.end(); it != endi; ++it) {
            if (scanf("%d%d", &it->first, &it->second) != 2) {
                std::abort();
            }
            assert(it->second > 0);
        }
        PointsList pointsB(N);
        for (PointsList::iterator it = pointsB.begin(), endi = pointsB.end(); it != endi; ++it) {
            if (scanf("%d%d", &it->first, &it->second) != 2) {
                std::abort();
            }
            assert(it->second < 0);
        }

        std::sort(pointsA.begin(), pointsA.end(), cmp_by_y);
        std::sort(pointsB.begin(), pointsB.end(), cmp_by_y);
        const PointsList::const_iterator min_a_by_y = pointsA.begin();
        const PointsList::const_iterator max_b_by_y = (pointsB.rbegin() + 1).base();
        assert(*max_b_by_y == pointsB.back());

        unsigned dist = manhattan_dist(*min_a_by_y, *max_b_by_y);
        const unsigned diff_x = std::abs(min_a_by_y->first - max_b_by_y->first);
        const unsigned best_diff_y = dist - diff_x;

        const int max_y_for_a = max_b_by_y->second + dist;
        const int min_y_for_b = min_a_by_y->second - dist;
        PointsList::iterator it;
        for (it = pointsA.begin() + 1; it != pointsA.end() && it->second <= max_y_for_a; ++it) {
        }
        if (it != pointsA.end()) {
            pointsA.erase(it, pointsA.end());
        }

        PointsList::reverse_iterator rit;
        for (rit = pointsB.rbegin() + 1; rit != pointsB.rend() && rit->second >= min_y_for_b; ++rit) {
        }
        if (rit != pointsB.rend()) {
            pointsB.erase(pointsB.begin(), (rit + 1).base());
        }
        std::sort(pointsA.begin(), pointsA.end(), cmp_by_x);
        std::sort(pointsB.begin(), pointsB.end(), cmp_by_x);

        for (size_t j = 0; diff_x > 0 && j < pointsA.size(); ++j) {
            const Point &cur_a_point = pointsA[j];
            assert(max_y_for_a >= cur_a_point.second);
            const int diff_x = dist - best_diff_y;
            const int min_x = cur_a_point.first - diff_x + 1;
            const int max_x = cur_a_point.first + diff_x - 1;

            const Point search_term = std::make_pair(max_x, std::numeric_limits<int>::min());
            PointsList::const_iterator may_be_near_it = std::lower_bound(pointsB.begin(), pointsB.end(), search_term, cmp_by_x);

            for (PointsList::const_reverse_iterator rit(may_be_near_it); rit != pointsB.rend() && rit->first >= min_x; ++rit) {
                const unsigned cur_dist = manhattan_dist(cur_a_point, *rit);
                if (cur_dist < dist) {
                    dist = cur_dist;
                }
            }
        }
        printf("%u\n", dist);
    }
}

Benchmark on my machine (Linux + i7 2.70 GHz + gcc -Ofast -march=native):

$ make bench
time ./test1 < data.txt  > test1_res

real    0m7.846s
user    0m7.820s
sys     0m0.000s
time ./test2 < data.txt  > test2_res

real    0m0.605s
user    0m0.590s
sys     0m0.010s

test1 is your variant, and test2 is mine.

And chance you could post a link to your input data? – Jason C Aug 28 '16 at 15:25 — Jason C, Aug 28 '16 at 15:25

score 1 · Answer 2 · answered Aug 27 '16 at 22:49

1

You'll need to learn how to write functions and how to use containers. With your current coding style, it's infeasible to get a better solution.

The problem is that the better solution is a recursive method. Sort the points by X coordinate. Now recursively split the set in half and determine the closest distance within each half as well as the closest distance between a pair of points from either half.

The last part is efficient because both halves are sorted by X. Comparing the last values from the left half with the first value of the right half gives a good upper bound on the distance.

answered Aug 27 '16 at 22:49

MSalters

173,980
10
155
350

That algorithms been shown to yield correct results for euclidean distances but are you certain it is correct for manhattan distances? – Jason C Aug 28 '16 at 15:23
1

@JasonC: Of course. The algorithm merely dictates which point pairs you need to consider. It doesn't dictate the distance measure. The worst case is actually the same for both metrics: both points of a pair having the same Y coordinate, so the distance d is just `abs(x1-x2)`. – MSalters Aug 28 '16 at 16:09

Jason C · Answer 3 · 2016-08-28T16:42:10.173

So there's a really simple optimization you can make that can cut a ton of time off.

Since you state that all points in set A have y > 0, and all points in set B have y < 0, you can immediately discard all points in A whose y > mindist and all points in B whose y < -mindist so far. These points can never be closer than the current closest pair:

for(int i = 0; i < n ;i++) {
    if (pointsA[i].nd > dist)
        continue; // <- this is the big one
    for(int j = 0; j < n ; j++) {
        if (pointsB[j].nd < -dist)
            continue; // <- helps although not as much
        if(abs(pointsA[i].st - pointsB[j].st) + abs(pointsA[i].nd - pointsB[j].nd) < dist) {
            dist = abs(pointsA[i].st - pointsB[j].st) + abs(pointsA[i].nd - pointsB[j].nd);
        }
    }
    printf("%lld\n", dist);
}

For a test of 40000 points per set, on my machine with gcc and -O2 this reduces the time from 8.2 seconds down to roughly 0.01 seconds (and yields correct results)! (measured with QueryPerformanceCounter on Windows).

Not too shabby.

Fwiw, computing your distance twice isn't actually that big of a deal. First of all that "second" calculation doesn't actually happen all that often, it only happens when a closer distance is found.

And secondly, for reasons I can't explain, storing it in a variable and only calculating it once actually consistently seems to add about 20% to the total run time, raising it from an average of 8.2 sec to about 10.5 seconds for the above set.

I'd say discarding points based on your assumptions about the Y values is by far the biggest bang for your buck you can get without significantly changing your algorithm.

You may be able to take advantage of that further by pre-sorting A and B in order of increasing Y values for A, and decreasing Y values for B, before finding the distances, to maximize the chance of skipping over subsequent point sets.

Malcolm McLean · Answer 4 · 2016-08-28T17:51:12.617

Keep a list of candidates in group A and group B, initially containing the whole input. Take min y of A and max y B for the closest pair in y, calculate the Manhattan distance, and eliminate any with y greater than the upper bound from the candidate lists. This might slash the input or it might have essentially no effect, but it's O(N) and a cheap first step.

Now sort the remaining candidates in x and y. This gives you a separated list in y ,and a mixed list in x, and is O(N log N), where N has been cut down, hopefully but not necessarily, by step one. For each point, now calculate its closest neighbour in y (trivial) and closest in x (a bit harder), then calculate its minimum possible Manhattan distance, assuming the closest in x is also the closest in y. Eliminate any points further than your bound from the candidate list. Now sort again, by minimum possible. That's another N log N operation. Now start with your best candidate and find its genuine minimum distance, by trying the closest point in either direction in x or y, and terminating when either delta x or delta y goes above the best so far, or delta x or delta y goes above your maximum bound. If you have better candidate pair then the current candidate pair, purge the candidate list of everything with a worse minimum possible. If the best candidate point doesn't form half of a candidate pair, you just purge that one point.

When you've purged a certain number of candidates, recalculate the lists. I'm not sure what the best value to use would be, certainly if you get to the worst candidate you must do that and then start again at the best. Maybe use 50%.

Eventually you are left with only one candidate pair. I'm not quite sure what the analysis is - worst case I suppose you only eliminate a few candidates on each test. But for most inputs you should get the candidate list down to a small value pretty fast.

Minimal distance in Manhattan metric

4 Answers4

Linked