Faulty conditional check in nearest neighbor calculation?

Question

I'm trying to write a function to calculate an approximate traveling salesman route through a list of cities using the nearest neighbor algorithm, starting from the first city listed. However, each time I run it I get IndexError: list index out of range.

In debugging the error, I found that the value of index stays the same from one loop iteration to the next, rather than changing. When it's time to append, the code checks the if not in condition; since it's False, it adds 1 to i and moves to the next iteration of the loop. Once it reaches a higher number than exists in the array, it gives me the error.

So my problem is, why does execution not enter the first if not in block? It seems like the code ignores it.

For my actual problem I'm reading in a file of 317 cities, each of which has an index and two coordinates. Here's a shorter sample list of cities for testing:

Nodelist = [
    (1, 63, 71),
    (2, 94, 71),
    (3, 142, 370),
    (4, 173, 1276),
    (5, 205, 1213),
    (6, 213, 69),
    (7, 244, 69),
    (8, 276, 630),
    (9, 283, 732),
    (10, 362, 69),
]

Here is the function's code:

def Nearest(Nodelist,Distance,index):
    Time_Calculation = time.time()
    Newarray=[]
    Newarray.append(Nodelist[0])
    for i in range(0,len(Nodelist)):
        for j in range(1,len(Nodelist)):
            if (Nodelist[j] not in Newarray):
                DisEquation = math.sqrt(pow(Nodelist[j][1]-Newarray[i][1],2)+pow(Nodelist[j][2]-Newarray[i][2],2))
                if Distance==0:
                    Distance=DisEquation
                if Distance > DisEquation:
                    index=j
                    Distance=DisEquation
        if(Nodelist[index] not in Newarray):
            Newarray.append(Nodelist[index])
        Distance=0
    print (time.time() - Time_Calculation)
    return Newarray

Code that calls it:

NearestMethodArr=Nearest(Cities,b,index)
print(NearestMethodArr)
print(len(NearestMethodArr))

The print statements should produce:

[(1, 63, 71), (2, 94, 71), (6, 213, 69), (7, 244, 69), (10, 362, 69), (3, 142, 370), (8, 276, 630), (9, 283, 732), (5, 205, 1213), (4, 173, 1276)]
10

Have you read the help page on making a [minimal, reproducible example](https://stackoverflow.com/help/minimal-reproducible-example)? Context is good, but you're including detail that doesn't seem relevant (mainly file-reading code). You're also *not* including everything needed to test this—namely, a (small) sample city list. (Also, clearer variable names would help everyone out, probably yourself included.) — CrazyChucky, Apr 16 '21 at 03:21
This is by far the best way I described my problem but I will work more on this I will edit and add a small list of the cities I have and where my error comes at, thanks for the feedback — Al-Meqdad Jabi, Apr 16 '21 at 03:23

David Oldford · Accepted Answer · 2021-04-16T19:04:17.943

Newarray.append(Nodelist[0])#adding Nodelist[0] to NewArray    
for i in range(0,len(Nodelist)):
        for j in range(1,len(Nodelist)):
            if (Nodelist[j] not in Newarray):
                DisEquation = math.sqrt(pow(Nodelist[j][1]-Newarray[i [1],2)+pow(Nodelist[j][2]-Newarray[i][2],2)) #you access Newarray at i
                if Distance==0:
                    Distance=DisEquation
                if Distance > DisEquation:
                    index=j
                    Distance=DisEquation
        if(Nodelist[index] not in Newarray):
            Newarray.append(Nodelist[index])#you conditionally add new elements to newarray

If you see the comments I added to your code the issues should be clear. You loop over all the elements of Nodelist and call the index i you've already added one element to NewArray so the first time index 0 exists. Then you hit the Nodelist[index] not in Newarray, if it is true NewArray gets bigger by 1 and then NewArray[1] works if this is ever not true for any reason then NewArray stays the same size and the next NewArray[i] will be an index out of range error.

Edit: thanks to CrazyChucky for setting me straight in the comments. I've adjusted below

My comment about the failure was correct though I hadn't identified the root cause which was not setting index as the author identified. I didn't parse the code correctly in my head. The code in the updated version works though it would be faster and easier to read if you did something like:

def new_nearest(Nodelist):
    start_time = time.time()
    shortest_path = [Nodelist[0]]
    Nodelist = Nodelist[1:]
    while len(Nodelist) > 0:
        shortest_dist_sqr = -1
        next_node = None
        for potential_dest in Nodelist:
            dist_sqr = (shortest_path[-1][1] - potential_dest[1])**2 + (shortest_path[-1][2] - potential_dest[2])**2 #you don't keep the distances so there is no need to sqrt as if a > b then a**2 > b**2
            if shortest_dist_sqr < 0 or dist_sqr < shortest_dist_sqr:
                next_node = potential_dest
                shortest_dist_sqr = dist_sqr
        shortest_path.append(next_node)
        Nodelist.remove(next_node)
    print(time.time() - start_time)
    return shortest_path

This returns identical results but executes faster. changing to an approach where you remove nodes from the inner loop makes it clearer what's going on, it might make the code a bit slower (it would in C but python has a lot of overhead in various places that may make this a net gain,) and because there is no need to calculate the actual distance since you don't store it you can compare the distance squared and not do any square roots. If you do want the distance you can just square root the nearest node after you've determined which it is.

Edit: I couldn't help but check. The move to removing nodes from Nodelist actually represents the majority of the time savings while the lack of a sqrt does reliably speed things up (I used timeit and varied the code.) In a lower level language doing tiny things is ultra fast so it's likely faster to leave the array alone and just skip over elements that are already used (this might not actually be true because it will mess with branch prediction performance analysis is really hard and will depend on what processor architecture you're using...) In python though even small things are very expensive (add two variables: figure out what types they are, decode variable byte length ints, do add, create new object for result...) so even though removing a value from a list may be more expensive than skipping values and leaving the list alone it will lead to more small operations that are very slow in Python. If a low level language was in use you could also recognise that the order of the nodes is arbitrary (except for which one is first,) so you can just have an array with all the nodes and instead of making a new small array you track the length of the values used in the array and copy the last value in the array over the value that is selected for the next node in the path.

Edit yet again :P : Again I couldn't help but be curious. The approach to overwriting part of the nodelist instead of removing entries had me wondering as to if it would be faster in python as it does create more work that is slow in python but decreases the amount of work involved in removing node elements. Turns out that this approach is a noticeable (though not very significant a little less than 2% but consistent in micro benchmarks) speedup even in python so the below code is even faster:

def newer_nearest(Nodelist):
    shortest_path = [Nodelist[0]]
    Nodelist = Nodelist[1:]
    while len(Nodelist) > 0:
        shortest_dist_sqr = -1
        next_node = None
        for index, potential_dest in enumerate(Nodelist):
            dist_sqr = (shortest_path[-1][1] - potential_dest[1])**2 + (shortest_path[-1][2] - potential_dest[2])**2 #you don't keep the distances so there is no need to sqrt as if a > b then a**2 > b**2
            if shortest_dist_sqr < 0 or dist_sqr < shortest_dist_sqr:
                next_node = index
                shortest_dist_sqr = dist_sqr
        shortest_path.append(Nodelist[next_node])
        Nodelist[next_node] = Nodelist[-1]
        Nodelist = Nodelist[:-1]
    return shortest_path

Hey! thanks for the feedback on my code, I solved my problem but I want to tell you why I added conditions on my append and why it actually worked, first of all I append Nodelist[0] as the first element in my array, so that I can compare the distance between this city and the others and choose the shortest, then about my condition and why I was sure the i int was the best way to do it, I know that every time I run my loop it will append an element to the new array that I have and finally it will reach 318 elements which is the same as the old array (Nodelist) — Al-Meqdad Jabi, Apr 16 '21 at 04:14
the condition is there because it runs 3 times after it reaches the max and adds the last index element to the new array at the end — Al-Meqdad Jabi, Apr 16 '21 at 04:15
If you know for sure that the code path will always be taken to add the additional value to Newarray then there is no reason for the if, you can just add the value every time without checking if the value at NodeList[index] is already in Newarray. Also if you use the looping structure I show and there are no duplicate nodes then distance will never be 0 so you can remove everything to do with that. — David Oldford, Apr 16 '21 at 12:34
@DavidOldford That looping method is indeed much more idiomatic, but it in this case the algorithm will often need to compare a node to nodes *before* it, since the traveling salesman certainly isn't guaranteed to traverse the node list in order. — CrazyChucky, Apr 16 '21 at 15:37
but the code you've shown is calculating the straight line distance between each node not the shortest path through a network of nodes. If you are using straight lines the distance from A to B will always be the same as the distance from B to A. This is to say your code does not attempt to solve the travelling salesmen problem. — David Oldford, Apr 16 '21 at 15:42
@DavidOldford It's kind of buried in their question and comments, and I didn't see it at first either, but they're trying to approximate the optimal traveling salesman path using the [nearest neighbor algorithm](https://en.wikipedia.org/wiki/Nearest_neighbour_algorithm). They want to start at the first node, then move to the closest node from that one, then repeat the process until all nodes have been visited, then return to the start. In each iteration, the current node needs to be measured against all remaining (unvisited) nodes, not just nodes *after* it in the original list. — CrazyChucky, Apr 16 '21 at 15:51
Oh you're right sorry, I also got confused and thought you were the author, I've been working 11+ hour days for a while and am still kind of dazed. So they skip the first node because they know they've already visited it, it looked to me like they were just generating a lookup table of distances between nodes. So NewArray is actually the path. It would be a lot easier to read and more efficient if they did the check for NewArray at the beginning of the loop instead of the end or just maintained a list or set of unvisited nodes that they removed values from over time and looped through that. — David Oldford, Apr 16 '21 at 16:08
The really confusing part is the double check for the node not being in the path. Index can never refer to an object that is in NewArray because it comes from j and values of j that are already in NewArray are skipped at the beginning of the loop. Having a list you remove values from is a hit or miss sort of thing as you're replacing loop passes with memory changes. Since this is in python it probably is quicker as you'd rely on more high level operations but also speed probably isn't a concern. The main readability issue that remains is variable naming and the impossible if in my opinion. — David Oldford, Apr 16 '21 at 16:26

CrazyChucky · Answer 2 · 2021-04-25T18:59:49.943

David Oldford's answer beat me to the punch for most functional improvements, but I wanted to touch on several specific things that can make your code cleaner and more Pythonic. Readability counts.

The main improvements below:

There is rarely, if ever, a need to use for i in range(len(sequence)) in Python. Doing so goes against the spirit of how Python for loops are designed. If you need an index, use for i, element in enumerate(sequence). (When you don't need the index, simply use for element in sequence.)
Give your variables names that are clear (they say what they are) and consistently formatted (ideally snake_case, but if you prefer camelCase or some other format, choose one and stick with it). This will not only help others reading your code; it will help you when you come back a month later and don't remember why you wrote what you did.
Break up long lines; put spaces around assignment (=) and comparison (e.g. ==), and after commas; use blank lines (sparingly) to separate chunks of code for better readability.
In Python you can create infinity with float('inf') (or negative infinity with -float('inf'). This is the idiomatic way to make comparisons to find the smallest or largest of something.
Your function operates on the list of nodes/cities. It doesn't make sense to supply a distance or index as an argument to the function.
Multiple assignment is often a good way to make code much clearer by assigning descriptive names to indices. It's much easier to read (x2 - x1)**2 + (y2 - y1)**2 and understand what it's doing, compared to pow(Nodelist[j][1]-Newarray[i][1],2)+pow(Nodelist[j][2]-Newarray[i][2],2).
I wouldn't include quite so many comments in my own code, were it not intended for a post like this. But some comments explaining control flow can be very helpful.

import time

def nearest_neighbor_path(cities):
    """Find traveling salesman path via nearest neighbor algorithm."""
    start_time = time.time()

    # Start at the first city, and maintain a list of unvisited cities.
    path = [cities[0]]
    remaining_cities = cities[1:]

    # Loop until every city has been visited. (An empty list evaluates
    # to False.)
    while remaining_cities:
        # In each loop, set starting coordinates to those of the current
        # city, and initialize shortest distance to infinity.
        _, x1, y1 = path[-1]
        shortest_so_far = float('inf')
        nearest_city = None

        # Investigate each possible city to visit next.
        for index, other_city in enumerate(remaining_cities):
            # Since we don't need the *actual* distance, only a
            # comparison, there's no need to take the square root.
            # (Credit to David Oldford, good catch!)
            _, x2, y2 = other_city
            distance_squared = (x2 - x1)**2 + (y2 - y1)**2

            # If it's the closest one we've seen so far, record it.
            if distance_squared < shortest_so_far:
                shortest_so_far = distance_squared
                index_of_nearest, nearest_city = index, other_city

        # After checking all possible destinations, add the nearest one
        # to the path...
        path.append(nearest_city)
        # ...and remove it from the list of remaining cities. This could
        # simply be remaining_cities.remove(nearest_city), in which case
        # we wouldn't need the index or enumerate() at all. But doing it
        # this way saves an extra iteration to find the city again in
        # the list.
        remaining_cities.pop(index_of_nearest)
    
    print(f'Elapsed time: {time.time() - start_time :f} seconds')
    return path

If you don't mind importing NumPy for its argmin function, the while loop could be further streamlined like so:

import numpy as np
import time

def nearest_neighbor_path(cities):
    """Find traveling salesman path via nearest neighbor algorithm."""
    start_time = time.time()
    path = [cities[0]]
    remaining_cities = cities[1:]

    while remaining_cities:
        _, x1, y1 = path[-1]
        distances = [(x2 - x1)**2 + (y2 - y1)**2
                     for _, x2, y2 in remaining_cities]
        index_of_nearest = np.argmin(distances)
        
        nearest_city = remaining_cities.pop(index_of_nearest)
        path.append(nearest_city)
    
    print(f'Elapsed time: {time.time() - start_time :f} seconds')
    return path

I'd also recommend taking a look at PEP 8, the official Python style guide. It's a very good set of guidelines for keeping Python code clear and readable. The easier your code is to read, the easier it is for you to identify problems and solutions.

Conceptually you'd expect that removing items from a set should be faster than from a list but modern processors are really good at doing lots of really simple things and not great at doing more complicated things. Sets are more complicated but cut down on the number of things while lists have lots of copying and no complicated things. Even for reasonably large numbers of values lists are reliably faster. — David Oldford, Apr 16 '21 at 19:18
`import timeit def rem_list(a): a.remove(0) def rem_set(a): a.remove(0) def run_bench(n): a = [i for i in range(n)] print(f"list:{timeit.timeit(lambda:remList(a.copy()),number=1000)} set:{timeit.timeit(lambda:remSet(set(a.copy())),number=1000)}") for n in [10,100,1000,1000000]: run_bench(n)` lists come out ahead even with a million elements, in fact unexpectedly the larger the set/list the more lists seem to come out ahead. lists could be optimized for removing values from the beginning but I get the same results if I pull from the middle. — David Oldford, Apr 16 '21 at 19:18
@DavidOldford Fascinating... I tested it just now with up to 10,000 cities, and I get the same time results for both. I really would not have expected that. Everything I've read in the past has always cautioned against adding or removing elements from the middle of lists. I've edited my answer, thank you! — CrazyChucky, Apr 16 '21 at 19:34
Oh, it hadn't occurred to me that we're also *iterating* over the set or list, and iterating over sets is often slower. That could definitely be part of it as well. — CrazyChucky, Apr 16 '21 at 19:40
yeah I didn't think to mention the iteration piece. In C++ the advice is generally to use a vector unless you really aren't sure you should in which case you should still use a vector and only deviate based on measured performance. The set even ignoring the iteration involves calculating a hash and potentially allocating memory and adjusting pointers or whatever. Processors are really good at copying large chunks of memory which is a worst case for removing from a list. — David Oldford, Apr 16 '21 at 20:04
I need to walk back my previous results a bit. I made a mistake and did the conversion to a set inside the timeit lambda. When I fix things sets are reliably faster for removal in the middle of the data when you want to remove by value vice index which is another thing I forgot to think about. Lists are faster removing early values and faster when removing by index if you already know where a value is even if it's in the middle of the data set. So I was correct that removing elements is faster I forgot about the lookup time to find them by value and the list is better for iteration. — David Oldford, Apr 16 '21 at 20:58
`def remove_element(a, n):` ` a.remove(n)` `def dellist(a,n):` ` del a[n]` `a = [i for i in range(n)]` `b = set(a)` `print(f"dellist:{timeit(lambda : dellist(a.copy(),n >> 1),number=1000)} list:{timeit(lambda : remove_element(a.copy(), n >> 1), number=1000)} set:{timeit(lambda: remove_element(b.copy(), n >> 1), number=1000)}")` — David Oldford, Apr 16 '21 at 21:00

score -1 · Answer 3 · answered Apr 16 '21 at 03:55

-1

I find out what was the problem with my code, when I reassigned the Distance to x I forgot that I need to reassign the index with it because the first city that got tested was had the shortest distance and In my first code I reassigned the variable index only when x is < than the Distance

new code:

def Nearest(Nodelist,Distance,index):
    Time_Calculation = time.time()
    Newarray=[]
    Newarray.append(Nodelist[0])
    for i in range(0,len(Nodelist)):
        for j in range(1,len(Nodelist)):
            if (Nodelist[j] not in Newarray):
                DisEquation = math.sqrt(pow(Nodelist[j][1]-Newarray[i][1],2)+pow(Nodelist[j][2]-Newarray[i][2],2))
                if Distance==0:
                    Distance=DisEquation
                    index=j
                if Distance > DisEquation:
                    index=j
                    Distance=DisEquation
        if(Nodelist[index] not in Newarray):
            Newarray.append(Nodelist[index])
        Distance=0
    print (time.time() - Time_Calculation)
    return Newarray
    ```

answered Apr 16 '21 at 03:55

Al-Meqdad Jabi

65
1
6

Why are `Distance` and `index` arguments, and what do you supply for them? – CrazyChucky Apr 16 '21 at 04:24
Also if you only add nodes to the final list if they're not already there, are you really finding nearest neighbors? Two different nodes could each have the same nearest neighbor. (Picture three nodes in a line. The nearest neighbor of the left node is the middle node. The nearest neighbor of the right node is *also* the middle node.) – CrazyChucky Apr 16 '21 at 04:38
when I thought of creating this function, I had so many problem with the variables especially since it's python, so I send normally I like to send variables as arguments to avoid any issues or errors with python. – Al-Meqdad Jabi Apr 16 '21 at 04:40
1

well that's true but i'm not aiming into creating a complete graph, here i'm just trying to make a route from city 0 and pass by every single city, so I choose the closest to each city and find the shortest path, I forgot to explain more this is a shortest path algorithm – Al-Meqdad Jabi Apr 16 '21 at 04:42
so here I will calculate distances and find the shortest path that allows me to visit every single city – Al-Meqdad Jabi Apr 16 '21 at 04:43

Faulty conditional check in nearest neighbor calculation?

3 Answers3