Algorithm to find 100 closest stars to the origin

Question

First let me phrase the proper question:

Q: There is a file containing more than a million points (x,y) each of which represents a star. There is a planet earth at (a,b). Now, the task is to build an algorithm that would return the 100 closest stars to earth. What would be the time and space complexities of your algorithm.

This question has been asked many times in various interviews. I tried looking up the answers but could not find a satisfactory one.

One way to do it which I thought might be using a max heap of size 100. Calculate distances for each star and check if the distance is lesser than the root of the max heap. If yes, replace it with the root and call heapify.

Any other better/faster answers?

P.S: This is not a homework question.

possible duplicate of [Find the x smallest integers in a list of length n](http://stackoverflow.com/questions/3764355/find-the-x-smallest-integers-in-a-list-of-length-n) — hugomg, Feb 08 '12 at 22:16
Yes, pity. It's an interesting question, but already answered here. — Daniel Fischer, Feb 08 '12 at 22:21
@missingno: It is kind of similar but that question can easily be solved by the solution which I have provided above. Here, there is some extra calculations required and I wanted to know if there is a way to minimize them. — noMAD, Feb 08 '12 at 22:31

score 29 · Accepted Answer · edited Feb 22 '15 at 15:25

You can actually do this in time O(n) and space O(k), where k is the number of closest points that you want, by using a very clever trick.

The selection problem is as follows: Given an array of elements and some index i, rearrange the elements of the array such that the ith element is in the right place, all elements smaller than the ith element are to the left, and all elements greater than the ith element are to the right. For example, given the array

40  10  00  30  20

If I tried to select based on index 2 (zero-indexed), one result might be

10  00  20  40  30

Since the element at index 2 (20) is in the right place, the elements to the left are smaller than 20, and the elements to the right are greater than 20.

It turns out that since this is a less strict requirement than actually sorting the array, it's possible to do this in time O(n), where n is the number of elements of the array. Doing so requires some complex algorithms like the median-of-medians algorithm, but is indeed O(n) time.

So how do you use this here? One option is to load all n elements from the file into an array, then use the selection algorithm to select the top k in O(n) time and O(n) space (here, k = 100).

But you can actually do better than this! For any constant k that you'd like, maintain a buffer of 2k elements. Load 2k elements from the file into the array, then use the selection algorithm to rearrange it so that the smallest k elements are in the left half of the array and the largest are in the right, then discard the largest k elements (they can't be any of the k closest points). Now, load k more elements from the file into the buffer and do this selection again, and repeat this until you've processed every line of the file. Each time you do a selection you discard the largest k elements in the buffer and retain the k closest points you've seen so far. Consequently, at the very end, you can select the top k elements one last time and find the top k.

What's the complexity of the new approach? Well, you're using O(k) memory for the buffer and the selection algorithm. You end up calling select on a buffer of size O(k) a total of O(n / k) times, since you call select after reading k new elements. Since select on a buffer of size O(k) takes time O(k), the total runtime here is O(n + k). If k = O(n) (a reasonable assumption), this takes time O(n), space O(k).

Hope this helps!

To this I would add one more optimization. Before adding a new element to the buffer, discard if it is larger than the k'th largest found in previous iterations. And in this "larger than" test, you can first check whether any single coordinate is larger before you test the actual distance. This doesn't change the big-O at all, but it avoids a lot of distance calculations, and the square root operation is fairly slow. So you get a better constant. — btilly, Feb 08 '12 at 22:40
@btilly: you can always avoid the sqrt operation, since sqrt is a monotonic function. Points that minimize distance also minimize distance squared (the square cancels out the sqrt). — Rob Neuhaus, Feb 08 '12 at 23:03
@rrenaud you're right. However multiplication is still more expensive than comparison, so avoiding squaring is still worthwhile. — btilly, Feb 09 '12 at 16:08
How did you decide to use '2-times-k' elements. Why not '3-times-k' or something else like that? — Darth.Vader, Sep 10 '13 at 07:03
@user721998- Any fixed multiple of k greater than will work. The number of selections is n / ((m - 1)k), where m is the multiple, and since each takes O(k) time the total runtime is O(n / (m-1)) and the space usage is O(mk). I just picked 2 because it's an easy number to work with. Good question! — templatetypedef, Sep 10 '13 at 08:01

score 1 · Answer 2 · answered Jul 27 '15 at 16:41

To elaborate on the MaxHeap solution you would build a max-heap with the first k elements from the file ( k = 100 in this case ).

The key for the max-heap would be its distance from Earth (a,b). Distance between 2 points on a 2d plane can be calculated using:

dist = (x1,y1) to (x2,y2) = square_root((x2 - x1)^2 + (y2 - y1)^2);

This would take O(k) time to construct. For every subsequent element from k to n. ie (n - k) elements you need to fetch its distance from earth and compare it with the top of max-heap. If the new element to be inserted is closer to earth than the top of the max-heap, replace the top of the max-heap and call heapify on the new root of the heap.

This would take O((n-k)logk) time to complete. Finally we would be left with just the k elements in the max-heap. You can call heapify k times to return all these k elements. This is another O(klogk).

Overall time complexity would be O(k + (n-k)logk + klogk).

score 0 · Answer 3 · answered Jun 01 '15 at 18:24

import sys,os,csv

iFile=open('./file_copd.out','rU')
earth = [0,0]



##getDistance return distance given two stars
def getDistance(star1,star2):
    return sqrt((star1[0]-star2[0])**2 +(star1[1]-star2[1])**2 )


##diction dict_galaxy looks like this  {key,distance}  key is the seq assign to each star, value is a list [distance,its cordinance]
##{1,[distance1,[x,y]];2,[distance2,[x,y]]}
dict_galaxy={}
#list_galaxy=[]
count = 0
sour=iFile.readlines()
for line in sour:
    star=line.split(',')   ##Star is a list [x,y]
    dict_galaxy[count]=[getDistance(earth,star),star]
    count++

###Now sort this dictionary based on their distance, and return you a list of keys.
list_sorted_key = sorted(dict_galaxy,key=lambda x:dict_galaxy[x][0])

print 'is this what you want %s'%(list_sorted_key[:100].to_s)
iFile.close()

I just coded this for your question in Python, hope it helps — aertoria, Jun 01 '15 at 18:25

score 0 · Answer 4 · answered Feb 08 '12 at 22:34

0

It's a famous question and there has been lot's of solution for that: http://en.wikipedia.org/wiki/K-nearest_neighbor_algorithm

if you did not find it useful, there are some other resources such as Rurk's computational geometry book.

answered Feb 08 '12 at 22:34

orezvani

3,595
8
43
57

The query point is already known in this case, so we dont even have to go to knn. – Sid Datta Feb 08 '12 at 23:43

score 0 · Answer 5 · answered Feb 08 '12 at 23:48

0

Your algorithm is correct. Just remember that time complexity of your program is O(n . log 100 ) = O(n), unless number of closest points to find can vary.

answered Feb 08 '12 at 23:48

Sid Datta

1,110
2
13
29

Algorithm to find 100 closest stars to the origin

5 Answers5

Linked