3

I have the following code that takes very long time to execute. The pandas DataFrames df and df_plants are very small (less than 1Mb). I wonder if there is any way to optimise this code:

import pandas as pd
import geopy.distance
import re

def is_inside_radius(latitude, longitude, df_plants, radius):
    if (latitude != None and longitude != None):
        lat = float(re.sub("[a-zA-Z]", "", str(latitude)))
        lon = float(re.sub("[a-zA-Z]", "", str(longitude)))
        for index, row in df_plants.iterrows():
            coords_1 = (lat, lon)
            coords_2 = (row["latitude"], row["longitude"])
            dist = geopy.distance.distance(coords_1, coords_2).km
            if dist <= radius:
                return 1
    return 0

df["inside"] = df.apply(lambda row: is_inside_radius(row["latitude"],row["longitude"],df_plants,10), axis=1)

I use regex to process latitude and longitude in df because the values contain some errors (characters) which should be deleted.

The function is_inside_radius verifies if row[latitude] and row[longitude] are inside the radius of 10 km from any of the points in df_plants.

ScalaBoy
  • 3,254
  • 13
  • 46
  • 84
  • How large is `df_nuclear`? – John Gordon Oct 14 '18 at 21:27
  • @JohnGordon: It's very small, contains approx. 20,000 rows. It's around 500Kb. – ScalaBoy Oct 15 '18 at 07:52
  • @PaulMcG: Have you actually read my question or did you only read the title? – ScalaBoy Oct 15 '18 at 07:53
  • Reopened - I suggest you clarify your title, btw – PaulMcG Oct 15 '18 at 10:47
  • Where was `df_nuclear` defined? Should it be `df_plants`? You may apply the function to the column. See example from https://stackoverflow.com/questions/34962104/pandas-how-can-i-use-the-apply-function-for-a-single-column – yoonghm Oct 15 '18 at 11:52
  • You may apply the function to the column. See example from https://stackoverflow.com/questions/34962104/pandas-how-can-i-use-the-apply-function-for-a-single-column and store the result on a new column. `coords_1 = (lat, lon)` should be outside of the `for` loop so that it is only executed once. – yoonghm Oct 15 '18 at 12:00
  • @yoonghm: Yes, I changed it. Thanks. – ScalaBoy Oct 15 '18 at 12:01
  • Can you give some examples of `row["latitude"]` and `row["longitude"]`? – yoonghm Oct 15 '18 at 12:59
  • @yoonghm: For example, `42.002` and `2.0890` as `row["latitude"]` and `row["longitude"]`, respectively. – ScalaBoy Oct 15 '18 at 13:24

2 Answers2

3

Can you try this?

import pandas as pd
from geopy import distance
import re

def is_inside_radius(latitude, longitude, df_plants, radius):
  if (latitude != None and longitude != None):
    lat = float(re.sub("[a-zA-Z]", "", str(latitude)))
    lon = float(re.sub("[a-zA-Z]", "", str(longitude)))
    coords_1 = (lat, lon)

    for row in df_plants.itertuples():
      coords_2 = (row["latitude"], row["longitude"])
      if distance.distance(coords_1, coords_2).km <= radius:
        return 1
  return 0

df["inside"] = df.map(
                    lambda row: is_inside_radius(
                      row["latitude"],
                      row["longitude"],
                      df_plants,
                      10),
                    axis=1)

From https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.iterrows.html#pandas-dataframe-iterrows, pandas.DataFrame.itertuples() returns namedtuples of the values which is generally faster than pandas.DataFrame.iterrows(), and preserve dtypes across returned rows.

yoonghm
  • 4,198
  • 1
  • 32
  • 48
2

I've encountered such a problem before, and I see one simple optimisation: try to avoid the floating point calculation as much a possible, which you can do as follows:
Imagine:
You have a circle, defined by Mx and My (center coordinates) and R (radius).
You have a point, defined by is coordinates X and Y.

If your point (X,Y) is not even within the square, defined by (Mx, My) and size 2*R, then it will also not be within the circle, defined by (Mx, My) and radius R.
In pseudo-code:

function is_inside(X,Y,Mx,My,R):
  if (abs(Mx-X) >= R) OR (abs(My-Y) >= R)
  then return false
  else:
    // and only here you perform the floating point calculation
Dominique
  • 16,450
  • 15
  • 56
  • 112