First let's review your solution: for each value in lon1 and for each value of lat1 (which is n^2 iterations in case they are in the size of n), you tried to filter the dataframe, which results in scanning the entire df: your code ran through the dataframe n^2 times, which is inefficient.
My solution requires to scan the dataframe only once, and for each scan it does n actions. It uses pandas apply
function which is not very efficient, but I could not find a way to do so without it. I would love to hear a solution to filter without using apply.
I used a small reproducible example, you may need to adapt the indices to match to your code. I believe this example is much easier to follow.
import pandas as pd
import numpy as np
df = pd.DataFrame({"lat":[22.5, 10.76, 7.341, 22.5], "log":[3.64, 7.234, 135, 3.644], "level":[2, 8, 19, 9]})
lat1 = np.array([22.51, 7.33])
lon1 = np.array([3.6, 135.02])
The next lines create a list of tuples, each consist a pandas.Interval
object. The tuple here is representing (lat1[i]+-x, lon1[i]+-x). Note to say, I did not have to use pandas.Interval - I could just build another tuple of (lat1[i]-x, lat1[i]+x). But I decided to go with pandas interval, does not really matter.
The result: for each pair of [lat1, lon1], we have a tuple of two pandas interval, each is +-0.125
interval_list = []
const_add = 0.125
for i, item in enumerate(lat1):
interval_list.append((pd.Interval(left=lat1[i]-const_add, right=lat1[i]+const_add),pd.Interval(left=lon1[i]-const_add, right=lon1[i]+const_add)))
Now we want to filter the dataframe. For using apply
, I created a custom function: it checks if current row is within a tuple, and if so it returns the index in lat1 array (you'll see later why is it useful)
def within_range(row, interval_list):
for i, item in enumerate(interval_list):
if row[0] in item[0] and row[1] in item[1]:
return i
return np.nan
df["point"] = df.apply(lambda x: within_range(x, interval_list), axis=1)
At that point of the code, we have a column name 'point'. The value of it is as follows: if the row is close to point i (where i is the index in lat1[i] and lon1[i]), the value is i. If it has no close point, the value is nan.
Now all that is left is to just find the maximum and minimum for each point, which can be achieve easily using groupby
:
max_series = df.groupby(by="point")["level"].max()
min_series = df.groupby(by="point")["level"].min()
You have two series where the index is same as the index in lat1 and lon[1]. You can convert them easily to array by using Series.array
.
It's worth to mention that you did not say how to handle missing values - if no point in df is close to point (lat1[50], lon1[50]), what is the value in the maximum and minimum array? That's why I leave it as a series, I believe it's easier to manipulate it before changing it to array.
The whole code together:
import pandas as pd
import numpy as np
df = pd.DataFrame({"lat":[22.5, 10.76, 7.341, 22.5], "log":[3.64, 7.234, 135, 3.644], "level":[2, 8, 19, 9]})
lat1 = np.array([22.51, 7.33])
lon1 = np.array([3.6, 135.02])
interval_list = []
const_add = 0.125
for i, item in enumerate(lat1):
interval_list.append((pd.Interval(left=lat1[i]-const_add, right=lat1[i]+const_add),pd.Interval(left=lon1[i]-const_add, right=lon1[i]+const_add)))
def within_range(row, interval_list):
for i, item in enumerate(interval_list):
if row[0] in item[0] and row[1] in item[1]:
return i
return np.nan
df["point"] = df.apply(lambda x: within_range(x, interval_list), axis=1)
max_arr = df.groupby(by="point")["level"].max()
min_arr = df.groupby(by="point")["level"].min()
# or:
# max_arr = df.groupby(by="point")["level"].max().array