0

I recently wrote some code, and it is creating the outputs I want, however it is taking an eternity... I have 700k customer lines to iterate through for 435 warehouses, and to run 100k took 3 hours.

I know this question may be vague, but im not sure why it is running so slowly. I am suspecting it is due to my nested for loops, but the data won't load in unless I chunk it.

import pandas as pd
import geopy.distance
dfware = pd.read_csv('dfware.csv', encoding = "ISO-8859-1")

dfcust = pd.read_csv(r'dfcust.csv', encoding = "ISO-8859-1")
ppmwinarray = []
#Chunk size to load in
csize=10 ** 3

Bigoutput = []

y=0
for dfcust in pd.read_csv(r'dfcust.csv', encoding = "ISO-8859-1", chunksize = csize):
    #For all columns (index) and rows, in datacust, iterate through those rows.
    y+=1
    print(y)
    ppmwinarray = []
    z=0
    for index,row in dfcust.iterrows():
        #Assign the Lattitude variable in the row titled 'lat' to lat1
        lat1 = row['Lat']
        # Assign the longitude variable in the row titled 'Long' to lon1
        lon1 = row['Lon']
        dlist=[]
        dindex=[]
        print(z)
        z+=1

        for index2, row2 in dfware.iterrows():
            y+=1
            lat2 = row2['Lat']
            lon2 = row2['Lon']
            coords_1 = [lat1, lon1]
            coords_2 = [lat2, lon2]
            distance = geopy.distance.distance(coords_1, coords_2).miles
            if distance > 300:
                distance = 0
            else:
                distance = distance
            dlist.append(distance)
            d_i = ((300-distance)/300)
            if d_i != 1:
                d_i=d_i
            else:
                d_i=0
                dindex.append(d_i)
            sumdi = sum(dindex)
            if sumdi == 0:
                sumdi = 1
            #Defining last 13 as PPM Index
            ppmdindex =(dindex[-13:])
            #Calculating their independent win chance
            IndependentWinChance=[ x/ sumdi for x in ppmdindex]
            #Store in an array
            ppmarray = IndependentWinChance
            #Summing independent chances to get sum chance
            sumppmWinChance = sum(ppmarray)
            #Appending the sum of all distance indexes
            ppmarray.append(sumdi)
            #Appending the sum of ppm win chance
            ppmarray.append(sumppmWinChance)
            ppmwinarray.append(ppmarray)
    Bigoutput.extend(ppmwinarray)


Bigoutputdf = pd.DataFrame(Bigoutput)
Bigoutputdf.to_csv('customers1234.csv')
exit()
Chandella07
  • 2,089
  • 14
  • 22
  • 6
    You should start by profiling the program (https://docs.python.org/3/library/profile.html) to see what parts are taking the longest – jdrd Nov 28 '18 at 02:18
  • While @jdrd 's way is unequivocally the correct and optimal way to find the answers you're looking for, if you're looking for a quick and simple way to time out your functions you can wrap them in something like this: https://stackoverflow.com/questions/1465146/how-do-you-determine-a-processing-time-in-python/14739514 – LearnWorkLearn Nov 28 '18 at 03:05
  • Would also recommend checking out JHuw's answer here: https://gis.stackexchange.com/questions/222315/geopandas-find-nearest-point-in-other-dataframe Integrating a spatial index or hashing function will greatly improve your speed. – LearnWorkLearn Nov 28 '18 at 03:20
  • First of all, you shouldn't have to loop pandas dataframe like this. Pandas is optimize to do operation on all the dataframe at once. Second, there is a great scipy package to find distance between coordinates, it's cKDTree of scipy.spatial [link](https://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.cKDTree.query.html#scipy.spatial.cKDTree.query). I can find the nearest neighbor and the distance between two datasets with millions of rows in few minutes with it. – Akarius Nov 28 '18 at 04:46

0 Answers0