Function appears to run faster each time it is called within a for loop

Question

I have written a very crude hill-climb/greedy algorithm for solving the travelling salesman problem. The aim was to use this as a benchmark and continually improve running time and optimise the solution at the same time.

EDIT: On the suggestion of Stefan Pochmann, this appears to be related to hardware and not the program. On a i7-4790 processor, doing sum(xrange(10**8)) immediately before starting the hill-climb loop will cause run time to more than half for the first hill-climb and improve even further on each loop.

If I run the hill-climb algorithm on the same data 10 times in for a loop (each hill-climb doing 10,000 iterations), I note that there is almost always a general decline in runtime for solution, with the final solution taking ~50% of the time required for the first solution. The only thing calculated on each solution is the hill-climb itself; supporting data such as the distance/time matrix and job list is stored in memory prior to all solutions. Thus, the first three functions are simply for MCVE and can be ignored.

The printed output is the runtime followed by a list of [(iteration_number, greedy_route_cost)] i.e. the iteration number at which a new best_route was selected when searching for a solution. The runtime seems independent of how many interim routes are stored from each run of the hill-climb. Each solution should be independent. Is anyone able to see a reason for this acceleration in calc_route_cost or hill_climb? What am I missing here and how would I go about profiling the cause of this acceleration? This is in Python 2.7.9.

import math
import random
import datetime

# To generate a random customer name for each fake order
LETTERS = ['a','b','c','d','e','f','g','h','i','j','k','l','m','n','o','p','q',
            'r','s','t','u','v','w','x','y','z']


def crow_flies(lat1, lon1, lat2, lon2):
    ''' For MCVE. Straight-line distance between two locations. Should not affect 
    run time.
    '''

    dx1,dy1 = (lat1/180)*3.141593,(lon1/180)*3.141593
    dx2,dy2 = (lat2/180)*3.141593,(lon2/180)*3.141593
    dlat,dlon = abs(dx2-dx1),abs(dy2-dy1)
    a = (math.sin(dlat/2))**2 + (math.cos(dx1) * math.cos(dx2) 
        * (math.sin(dlon/2))**2)
    c = 2*(math.atan2(math.sqrt(a),math.sqrt(1-a))) 
    km = 6373 * c
    return km


def gen_matrix(order_list):
    ''' For MCVE. Returns dictionary of distance and time between all 
    location pairs. Should not affect run time.
    Returns {(location_id_from, location_id_to): [distance, time]}
    '''

    matrix = {}
    for loc in order_list:
        for pair_loc in order_list:
            if loc[0] != pair_loc[0]:
                distance = crow_flies(loc[1], loc[2], pair_loc[1], pair_loc[2])
                matrix[(loc[0], pair_loc[0])] = [distance, distance*random.random()]
    return matrix


def gen_jobs(num_jobs, start_time, end_time, lat_low, lat_high, lon_low, lon_high):
    ''' For MVCE. Creates random jobs with random time windows and locations 
    '''

    job_list = []
    for x in range(num_jobs):
        lat = random.uniform(lat_low, lat_high)
        lon = random.uniform(lon_low, lon_high)
        start = random.randrange(start_time, end_time-120, 1)
        end = start + 120
        faux_start = random.randrange(start, end-60, 1)
        faux_end = faux_start + 60
        capacity_demand = random.choice([-1, 1])
        name_1 = random.choice(LETTERS)
        name_2 = random.choice(LETTERS)
        name_3 = random.choice(LETTERS)
        name_4 = random.choice(LETTERS)
        name_5 = random.choice(LETTERS)
        NAME = name_1 + name_2 + name_3 + name_4 + name_5
        job_list.append([NAME, lat, lon, start, end, faux_start, faux_end, 
                        capacity_demand])
    return job_list


def calc_route_cost(start_time, route, matrix):
    ''' Returns the cost of each randomly generated route '''

    cost = 0

    # Mileage cost
    dist_cost = sum([matrix[(route[x][0], route[x+1][0])][0] for x in 
                    range(len(route)-1)]) * 0.14
    cost += dist_cost

    # Man-hour cost
    time_cost = sum([matrix[(route[x][0], route[x+1][0])][1] for x in 
                    range(len(route)-1)]) * 0.35        
    cost += time_cost

    for x in range(0, len(route)-1):
        travel_time = matrix[(route[x][0], route[x+1][0])][1]
        arrival = start_time + travel_time
        start_time += travel_time
        departure = arrival + 10

        if arrival < route[x+1][3]:
            # Penalise early arrival
            arr_cost = (route[x+1][3] - arrival)**2
            cost += arr_cost

        elif departure > route[x+1][4]:
            # Penalise late departure
            dep_cost = (departure - route[x+1][4])**2
            cost += dep_cost

        if arrival < route[x+1][5]:
            # Penalise 'soft' early arrival i.e. earlier than a fake prediction 
            # of arrival
            faux_arr_cost = (route[x+1][5] - arrival)**1.2
            cost += faux_arr_cost 

        elif departure > route[x+1][6]:
            # Penalise 'soft' late departure
            faux_dep_cost = (departure - route[x+1][6])**1.2
            cost += faux_dep_cost

    return cost


def hill_climb(jobs, matrix, iterations):
    ''' Randomly generate routes and store route if cheaper than predecessor '''

    cost_tracking, iteration_track = [], []
    initial_cost = calc_route_cost(480, jobs, matrix)
    best_cost = initial_cost
    best_route = jobs[:]
    changed_route = jobs[:]

    for x in range(iterations):
        random.shuffle(changed_route)
        new_cost = calc_route_cost(480, changed_route, matrix)

        if new_cost < best_cost:
            best_route = changed_route[:]
            best_cost = new_cost
            cost_tracking.append(best_cost)
            iteration_track.append(x)

    return cost_tracking, iteration_track


if __name__ == '__main__':

    #random_jobs = gen_jobs(20, 480, 1080, 24, 25, 54, 55)

    random_jobs = [['lmizn', 24.63441343319078, 54.766698677134784, 501, 621, 558, 618, 1],
                    ['jwrmk', 24.45711393348282, 54.255786174435165, 782, 902, 782, 842, 1],
                    ['gbzqc', 24.967074991405035, 54.07326911656665, 682, 802, 687, 747, 1],
                    ['odriz', 24.54161147027789, 54.13774173532877, 562, 682, 607, 667, -1],
                    ['majfj', 24.213785557876257, 54.452603867220475, 681, 801, 731, 791, -1],
                    ['scybg', 24.936517492880274, 54.83786889438055, 645, 765, 662, 722, -1],
                    ['betow', 24.78072704532661, 54.99907581479066, 835, 955, 865, 925, -1],
                    ['jkhmp', 24.88461478479374, 54.42327833917202, 546, 666, 557, 617, -1],
                    ['wbpnq', 24.328080543462, 54.85565694610073, 933, 1053, 961, 1021, -1],
                    ['ezguc', 24.292203133848382, 54.65239508177714, 567, 687, 583, 643, -1],
                    ['nlbgh', 24.111932340385735, 54.895627940055995, 675, 795, 711, 771, -1],
                    ['rtmbc', 24.64381176454049, 54.739636798961044, 870, 990, 910, 970, 1],
                    ['znkah', 24.235361720889216, 54.699010081109854, 627, 747, 645, 705, -1],
                    ['yysai', 24.48931405352803, 54.37480185313546, 870, 990, 882, 942, -1],
                    ['mkmbk', 24.5628992946158, 54.219159859450926, 833, 953, 876, 936, -1],
                    ['wqygy', 24.035376675509728, 54.92994438408514, 693, 813, 704, 764, -1],
                    ['gzwwa', 24.476121543580852, 54.13822533413381, 854, 974, 879, 939, 1],
                    ['xuyov', 24.288078529689894, 54.81812092976614, 933, 1053, 935, 995, 1],
                    ['tulss', 24.841925420359246, 54.08156783033599, 670, 790, 684, 744, -1],
                    ['ptdng', 24.113767467325335, 54.9417036320267, 909, 1029, 941, 1001, 1]]

    matrix = gen_matrix(random_jobs)

    # Below is the loop that appears to accelerate

    for p in range(10):
        start = datetime.datetime.now()
        sim_ann = hill_climb(random_jobs, matrix, 10000)
        end = datetime.datetime.now()

        # Number of iterations against greedy cost
        iteration_count = zip(sim_ann[1], sim_ann[0])

        print 'RUNTIME:', str(end - start)        
        print 'SOLUTION CONVERGENCE:', str(iteration_count)

Have you ever heard about *caching* and [CPU branching prediction](http://stackoverflow.com/questions/11227809/why-is-it-faster-to-process-a-sorted-array-than-an-unsorted-array)? It's **normal** that if you run the exact same code with the exact same data the CPU will be able to run it faster because it has the data in the cache memory and the branch predictor has kept track of which branches were true and will perform better predictions. This is somewhat less true in python because it is interpreted but it *still* has an impact. — Bakuriu, Jul 24 '16 at 17:00
@Bakuriu I have actually read that answer in the past. However, I didn't think it would apply here because new routes are generated entirely at random with `random.shuffle()` in `hill_climb`. Surely then, there is nothing to build on on each iteration of the `hill_climb` algorithm? The only thing I can think of here that isn't random is the lookup of dictionary values for distance/time and that shouldn't get quicker? Are you able to be more specific on what you think the CPU is optimising on in this problem? — roganjosh, Jul 24 '16 at 17:07
@Bakuriu in other words, I don't know where in this approach the CPU could anticipate whether or not a solution would be better or worse than an existing solution. This is not a sorted list with a distinct turning point of an `if/else` case; solutions are generated at random. — roganjosh, Jul 24 '16 at 17:25
I tried it (with Python 2.7.11) and all runs take about the same time (around 0.32 seconds). What happens if you warm up the CPU/process by doing some unrelated heavy calculations beforehand? — Stefan Pochmann, Jul 24 '16 at 17:54
@StefanPochmann that's really interesting. I have 16gig ram here and i7-4790 processor. Run times start around 1.1 - 1.2 sec and drop to a min 0.5 sec every time across 10 iterations. I ran almost continually for 30 mins because I really thought it was just due to randomness and I wanted to be sure before posting. I'll go try hammer my CPU some more, but I'm shocked at that discrepancy! — roganjosh, Jul 24 '16 at 17:58
Oh I'm just talking about a few seconds, not 30 minutes. About the time it takes your ten runs, since that apparently suffices. Just in case your CPU needs a moment to decide to go to full speed or in case you have other processes running so that your Python process doesn't get maximum resources initially and has to gradually prove itself to be worthy of it. (Not really sure this is how things work, though :-) — Stefan Pochmann, Jul 24 '16 at 18:11
@StefanPochmann hardware must go some way to explain this, but I would have thought I'd have decent processing power even with the GIL. I'm shocked you solve this faster than me even once the strange acceleration kicks in, but I don't know enough. Any randomly-generated route should have random outcomes for the `if/elif` branches in `calc_route_cost` so I'm not satisfied that CPU branching optimisation is the only root cause unless someone can explain to me properly. And if it really _is_ the case, I wanna know how to exploit it because time is critical :) — roganjosh, Jul 24 '16 at 18:16
Why shocked, maybe I have the better CPU? :-) Mine's an i7-6700 (with 32 GB, if that matters). Did you try what I suggested, and did it change anything? In case it wasn't clear, I just mean something like `sum(xrange(10**8))` right in front of the `p`-loop. — Stefan Pochmann, Jul 24 '16 at 18:24
@StefanPochmann Ok, that's taught me something I didn't know before. 0.518 --> 0.316 sec. So the idea here would be to keep CPU under constant stress in a concurrent script somewhere (because calling for a large calculation before a solution does nothing to bring down overall runtime of the solution itself)? I've clearly missed something in my self-teaching; if you have an approach you use consistently, please do write up and I will mark correct.Thank you for that. — roganjosh, Jul 24 '16 at 18:32
@StefanPochmann In other words, I could keep a core busy on the server doing some random calculation that tells the CPU it's "business time" and accelerate all other threads going through the other cores. It's a hardware limitation but not specifically CPU branching prediction. I just need the hardware primed to solve more complex problems. — roganjosh, Jul 24 '16 at 18:53
You mean you added that `sum(xrange(10**8))` and those are your times now, 0.518 for the first run and 0.316 for the tenth? Not sure about the concurrent script. If you run it in a separate process and if the reason was that your script's process needs to fight its way to maximum CPU power, then I'd expect that to not help. Because it's a different process. You could try having your script doing "random" work until a job arrives to be solved, and then stop the random work while solving the job - all in the same process. (Not sure why you keep bringing up branching prediction. I'm not Bakuriu.) — Stefan Pochmann, Jul 24 '16 at 18:59
@StefanPochmann fair. Drop branching prediction, I know it's not the cause now and I've moved away from that; sorry if it appears to run through my comments. You are bang-on that the processor needs priming to solve the problem; I assume that this is for electricity-conservation that CPU doesn't ramp-up by default. Yes, `sum(xrange(10**8))` causes the subsequent loop to smash benchmarks on the routing problem here. But calculating that before each loop is as bad as a slow loop. I need a way of keeping CPU on maximum power. I want to keep the CPU primed for the solution. — roganjosh, Jul 24 '16 at 19:06
Especially now that you said "server", I suspect that your system is simply busy. Meaning the hardware is already warmed up (I think CPU speed is changed very quickly anyway) and it's just that your OS initially doesn't give your script/process high priority and thus it needs to go work for it. Or maybe there's a way to tell the OS to right away give the script/process high priority. Don't know about that, and depends on your system. — Stefan Pochmann, Jul 24 '16 at 19:07
@StefanPochmann please write the `sum(xrange(10**8))` before `n` as a solution so I can accept and close. I was mistaken in thinking this was a programming problem but it is a hardware optimisation issue. Google gave me nothing on this. Thank you for your help! Where I take this is my concern but you've solved the immediate curiosity/confusion of the question. — roganjosh, Jul 24 '16 at 21:39
I'm glad you wrote an answer yourself and added that test and its results, I wouldn't have felt comfortable writing one because I was still uncertain about things and because I could've just reported stuff *you* had done. — Stefan Pochmann, Jul 25 '16 at 18:33
@StefanPochmann I wanted to give you opportunity to do so. I never thought Windows had such overwhelming control over this kind of thing. Honestly, I could run the script in an infinite loop and nothing speeds up, it's `sum(xrange(10**8))` running _concurrently_ that does it (even better than running before hill-climb loop). I cannot accept my own answer yet and I will still accept yours should you post, I just set things in motion to close it myself if you chose not to reply. — roganjosh, Jul 25 '16 at 18:37
@StefanPochmann essentially you saved me a hell of a lot of time debugging this :) I was convinced it was somehow in the code but I couldn't see where anything could be optimised (each branch of `if/elif` should be equally likely). If you don't take the answer then I still want to thank you as I was totally lost! — roganjosh, Jul 25 '16 at 18:50
It's fine, your answer is good. And accepting it should get you off that devilish reputation :-). Very interesting that a concurrent script has an effect that big, I wouldn't have expected that, so I learned something as well. Btw, I have Windows 10, maybe that's also better at this? High time to upgrade, only four days left :-D — Stefan Pochmann, Jul 25 '16 at 19:13

roganjosh · Accepted Answer · 2016-07-26T19:09:01.040

This issue is related to OS/hardware allocation of resources to the CPU, not CPU branch prediction. On Windows 7, running another script concurrently that simply does:

for x in range(10):
    a = sum(xrange(10**8))

will cause massive acceleration in the execution of the script in this question; each iteration of for p in range(10): will take only 25% of the time to run compared to when the PC is in an idle state. Runtime will also be consistent across each loop.

In Windows 7 the issue can be totally solved by following "Setting the Power Profile" as in this link. Essentially, changing the "Power Profile" into "High Performance".

I'm not clear why sum(xrange(10**8)) drastically accelerates the release of resources, whilst 100,000 iterations of hill-climb (over the 10 iterations) does not reach maturity, even though runtime is similar and hill-climb is more complex. That's a very slow drip-feed.

It seems impossible to benchmark algorithms outside of setting to High Performance in Windows 7. Running the hill-climb script here on an infinite loop will show the same characteristics across iterations unless you stop Windows from throttling CPU. CPU resources appear to reset each time the script is called.

Function appears to run faster each time it is called within a for loop

1 Answers1

Linked