1

I have the code below that works for calculating distances between coordinates of cities where a trip with public transport is started and the coordinates of cities where the trip is ended and returns the value. There is a unique number of combinations from a particular city and to a particular city, The problem is that I have a large data set of around 1.2 million records and the code is rather slow because it iterates for each combination. How can I rearrange the loop so it calculates the distances between coordinates for the unique combinations and applies it to combinations that are repeated? Is there any way that takes less processing times?

df_distance = []
for row in clean_df.iterrows():
    try:
        coords_1 = (row[1].Lat_x, row[1].Lng_x)
        coords_2 = (row[1].Lat_y, row[1].Lng_y)
        distance = geodesic(coords_1, coords_2).km
        df_distance.append(distance)
        #print(geodesic(coords_1, coords_2).km)
    except ValueError as e:
        print(row)
Elda
  • 49
  • 1
  • 10

1 Answers1

0

I rewrote the loop which shortens the processing time of my dataset coordinates distance calculations: I created an empty dictionary that will save the distance calculations for unique combinations of origin-destination trips. For unique combinations create a unique code that will sum as a string the codes of the origin and destination municipalities and add them to the dictionary. If such a unique code is encounter again (repeated) add the distance to the dictionary else calculate the distance and add it to the dictionary.

distance_dict = {}
df_distance = list()
for row in clean_df.iterrows():
    try:
        uniquecode = str(row[1].from_municipality_code) + str(row[1].to_municipality_code)
        if uniquecode in distance_dict:
            df_distance.append(distance_dict[uniquecode])
            continue
        else:
            coords_1 = (row[1].Lat_x, row[1].Lng_x)
            coords_2 = (row[1].Lat_y, row[1].Lng_y)
            distance = geodesic(coords_1, coords_2).km
            distance_dict[uniquecode] = distance
            df_distance.append(distance)
        #print(geodesic(coords_1, coords_2).km)
    except ValueError as e:
        print(row)
Elda
  • 49
  • 1
  • 10