3

This question covers a software algorithm, from On topic

I am working on an interview question from Amazon Software Question, specifically
"Given a set of points (x,y) and an integer "n", return n number of points which are close to the origin"

Here is the sample high level psuedocode answer to this question, from Sample Answer
Step 1: Design a class called point which has three fields - int x, int y, int distance
Step 2: For all the points given, find the distance between them and origin
Step 3: Store the values in a binary tree
Step 4: Heap sort
Step 5: print the first n values from the binary tree

I agree with steps 1 and 2 because it makes sense in terms of object-oriented design to have one software bundle of data, Point, encapsulate away the fields of x, y and distance.Ensapsulation

Can someone explain the design decisions from 3 to 5?

Here's how I would do steps of 3 to 5
Step 3: Store all the points in an array
Step 4: Sort the array with respect to distance(I use some build in sort here like Arrays.Sort
Step 5: With the array sorted in ascending order, I print off the first n values

Why the author of that response use a more complicated data structure, binary tree and not something simpler like an array that I used? I know what a binary tree is - hierarchical data structure of nodes with two pointers. In his algorithm, would you have to use a BST?

Community
  • 1
  • 1
committedandroider
  • 8,711
  • 14
  • 71
  • 126
  • 1
    Perhaps they mean the implicit binary tree that heap sort usually works with? – harold Feb 20 '15 at 21:46
  • As @harold said, it sounds like they mean you insert and extract from a binary heap, which has a Big-O of `log(n)`, which is better than the `n log (n)` you would get with the sorted array approach. – jpriebe Feb 20 '15 at 22:02
  • 1
    @jpriebe but you have to insert n items ... so it will still be n log n – Adrian Leonhard Feb 20 '15 at 22:02
  • Maybe because the time and space complexity of Arrays.sort is more than when you do a heap sort (worst case nlogn). http://stackoverflow.com/a/22571601/2344337 – Nemin Feb 20 '15 at 22:05
  • You're right Adrian, my mistake – jpriebe Feb 20 '15 at 22:06
  • @Nemin Actually in Java 7, sort(Object[]) uses Timsort, which is O(n log n). My answer, which you linked, refers to sort(int[]), where a variant of quicksort is used that is indeed Omega(n^2) – Niklas B. Feb 20 '15 at 22:10
  • @Nemin wouldn't it be the same cause the time complexity of heap sort overall is also O(n log n)? – committedandroider Feb 21 '15 at 02:38

3 Answers3

1

First, I would not say that having Point(x, y, distance) is good design or encapsulation. distance is not really part of a point, it can be computed from x and y. In term of design, I would certainly have a function, i.e. a static method from Point or an helper class Points.

double distance(Point a, Point b)

Then for the specific question, I actually agree with your solution, to put the data in an array, sort this array and then extract the N first. What the example may be hinted at is that the heapsort actually often uses a binary tree structure inside the array to be sorted as explained here :

The heap is often placed in an array with the layout of a complete binary tree.

Of course, if the distance to the origin is not stored in the Point, for performance reason, it had to be put with the corresponding Point object in the array, or any information that will allow to get the Point object from the sorted distance (reference, index), e.g.

List<Pair<Long, Point>> distancesToOrigin = new ArrayList<>();

to be sorted with a Comparator<Pair<Long, Point>>

T.Gounelle
  • 5,953
  • 1
  • 22
  • 32
  • Wouldn't it be better to calculate the distance once and store it as a field rather than re calculating it from x and y each time you need it? – committedandroider Feb 21 '15 at 02:41
  • Even better, you can use the square of the distance to compare their distance from the origin. Saves a square root. – harold Feb 21 '15 at 09:10
  • @committedandroider, that is exactly the purpose of sorting a `Pair` of `distance` and `Point`. The distance is only computed once for each `Point`. And this way, you can actually do what @harold suggest : use the squared distance instead of just distance. It saves some computing. – T.Gounelle Feb 21 '15 at 10:11
  • @AbbéRésina What does that pair do? – committedandroider Feb 23 '15 at 04:45
  • That is just a convenient way to link a `distance` to a `Point` for the time of sorting : the `Comparator` will only use the distance field. You can also have the computed distance _cached_ as a `transient` field of `Point` and computed/accessed through a `distanceFromOrigin()` method. – T.Gounelle Feb 23 '15 at 10:20
0

It is not necessary to use BST. However, it is a good practice to use BST when needing a structure that is self-sorted. I do not see the need to both use BST and heapsort it (somehow). You could use just BST and retrieve the first n points. You could also use an array, sort it and use the first n points. If you want to sort an array of type Point, you could implement the interface Comparable (Point would imolement that interface) and overload the default method. You never have to choose any data structures, but by determining the needs you have, you would also easily determine the optimum structure.

reaponja
  • 1
  • 2
  • For performance wise, would you go binary search tree or binary tree and heapsort? – committedandroider Feb 21 '15 at 02:31
  • I almost always implement binary trees as BSTs, depending on the criteria of search. BST is a dictionary and as such it is good for the operations of insertion, deletion and search of data. In this case, the dilemma would rather be (in my opinion) between a dictionary and a sequential data structure, like array or arraylist. In this instance, I'd use an arraylist, since you do not know the exact number of points you'll need stored. – reaponja Feb 21 '15 at 07:37
0

The approach described in this post is more complex than needed for such a question. As you noted, simple sorting by distance will suffice. However, to help explain your confusion about what your sample answer author was trying to get at, maybe consider the k nearest neighbors problem which can be solved with a k-d tree, a structure that applies space partitioning to the k-d dataset. For 2-dimensional space, that is indeed a binary tree. This tree is inherently sorted and doesn't need any "heap sorting."

enter image description here

It should be noted that building the k-d tree will take O(n log n), and is only worth the cost if you need to do repeated nearest neighbor searches on the structure. If you only need to perform one search to find k nearest neighbors from the origin, it can be done with a naive O(n) search.

How to build a k-d tree, straight from Wiki:

One adds a new point to a k-d tree in the same way as one adds an element to any other search tree. First, traverse the tree, starting from the root and moving to either the left or the right child depending on whether the point to be inserted is on the "left" or "right" side of the splitting plane. Once you get to the node under which the child should be located, add the new point as either the left or right child of the leaf node, again depending on which side of the node's splitting plane contains the new node.

Adding points in this manner can cause the tree to become unbalanced, leading to decreased tree performance. The rate of tree performance degradation is dependent upon the spatial distribution of tree points being added, and the number of points added in relation to the tree size. If a tree becomes too unbalanced, it may need to be re-balanced to restore the performance of queries that rely on the tree balancing, such as nearest neighbour searching.

Once have have built the tree, you can find k nearest neighbors to some point (the origin in your case) in O(k log n) time.

Straight from Wiki:

Searching for a nearest neighbour in a k-d tree proceeds as follows:

  1. Starting with the root node, the algorithm moves down the tree recursively, in the same way that it would if the search point were being inserted (i.e. it goes left or right depending on whether the point is lesser than or greater than the current node in the split dimension).
  2. Once the algorithm reaches a leaf node, it saves that node point as the "current best"
  3. The algorithm unwinds the recursion of the tree, performing the following steps at each node:
    1. If the current node is closer than the current best, then it becomes the current best.
    2. The algorithm checks whether there could be any points on the other side of the splitting plane that are closer to the search point than the current best. In concept, this is done by intersecting the splitting hyperplane with a hypersphere around the search point that has a radius equal to the current nearest distance. Since the hyperplanes are all axis-aligned this is implemented as a simple comparison to see whether the difference between the splitting coordinate of the search point and current node is lesser than the distance (overall coordinates) from the search point to the current best.
      1. If the hypersphere crosses the plane, there could be nearer points on the other side of the plane, so the algorithm must move down the other branch of the tree from the current node looking for closer points, following the same recursive process as the entire search.
      2. If the hypersphere doesn't intersect the splitting plane, then the algorithm continues walking up the tree, and the entire branch on the other side of that node is eliminated.
  4. When the algorithm finishes this process for the root node, then the search is complete.

This is a pretty tricky algorithm that I would hate to need to describe as an interview question! Fortunately the general case here is more complex than is needed, as you pointed out in your post. But I believe this approach may be close to what your (wrong) sample answer was trying to describe.

The111
  • 5,757
  • 4
  • 39
  • 55
  • FWIW, since the query point is fixed, you don't need any of that. – Niklas B. Feb 20 '15 at 22:13
  • Right. As I noted if k is also fixed you don't even need to do anything except a naive linear search. But I think k-d trees are interesting, and more poignantly, I think they might be what the wrong answer in the OP was trying to get at. For fixed "nearest" source (i.e. origin) and general k though I agree your O(n) quickselect is probably unbeatable. – The111 Feb 20 '15 at 22:14