0

i've got a dataframe with two columns that are the Coordinates of a point. i need to populate a column (full of None) with a specific value if that point is in a particular position. that position and that label is stored in another df

is not easy to explayn but i hope that with an example it will be clear: DF1

   latitude  longitude  LABEL
0    1.3       2.7      None
1    3.5       3.6      None
2    2.8       3.0      None
3    9.7       1.9      None
4    6.2       5.7      None
5    1.7       3.4      None
6    3.5       1.4      None
7    2.7       6.6      None
8    1.7       2.7      None
9    1.3       1.3      None

DF2

   minlat     maxlat    minlong   maxlong  STRING
0    1.0       2.0        1.0       3.0     AAA
1    3.0       4.0        1.0       2.0     BBB
2    3.0       4.0        3.0       4.0     CCC
3    5.0       7.0        2.0       3.0     DDD

the final result is:

   latitude  longitude  LABEL
0    1.3       2.7      AAA
1    3.5       3.6      CCC
2    2.8       3.0      None
3    9.7       1.9      None
4    6.2       5.7      None
5    1.7       3.4      None
6    3.5       1.4      BBB
7    2.7       6.6      None
8    1.7       2.7      AAA
9    1.3       1.3      AAA

the code for now is:

for i in range(len(df2)-1):
DF1.loc[(DF1['latitude']>=DF2.loc[i:i,'minlat'].at[i]) & (DF1['latitude']<DF2.loc[i:i,'maxlat'].at[i]) &
   (DF1['longitude']>=DF2.loc[i:i,'minlong'].at[i]) & (DF1['longitude']<DF2.loc[i:i,'maxlong'].at[i]),'LABEL'] = DF2.loc[i:i,'STRING'].at[i]

screen to have a better indent:

enter image description here

so for each line of DF2 i check if the values of DF1 are in the between and i assign a string

but like this it takes lot of time. have you some advice on what can i do? my problem was that each value of NUMBER_1 must be checked with each line of DF2, not only with the one with the same index.

EDIT: I'm trying other aproaches:

2)

for i in range(len(xlsx_fact_maneuver_specialareas)-1):
    minLat=DF2.loc[i:i,'minLat'].at[i]
    maxLat=DF2.loc[i:i,'maxLat'].at[i]
    minLong=DF2.loc[i:i,'maxLat'].at[i]
    maxLong=DF2.loc[i:i,'maxLong'].at[i]
    DF1.loc[(DF1['latitude']>=minLat) & (DF1['latitude']<maxLat) &
   (DF1['longitude']>=minLong) & (DF1['longitude']<maxLong),'LABEL'] = DF2.loc[i:i,'STRING'].at[i]

that tooks me less locally but more when i try it on the machine.

and

for i in range(len(xlsx_fact_maneuver_specialareas)-1):
    minLat=DF2.loc[i:i,'minLat'].at[i]
    maxLat=DF2.loc[i:i,'maxLat'].at[i]
    minLong=DF2.loc[i:i,'maxLat'].at[i]
    maxLong=DF2.loc[i:i,'maxLong'].at[i]
    DF1 = DF1.assign(
        label =  np.select(
          [(DF1['latitude']>=minLat) & (DF1['latitude']<maxLat) & (DF1['longitude']>=minLong) & (DF1['longitude']<maxLong)],
          [DF2.loc[i:i,'STRING'].at[i]],
          [None]))

that tooks me more locally but less on the machine

norok2
  • 25,683
  • 4
  • 73
  • 99
  • you can create a matrix and try to see if there is some reduction you can do in a vectorized fashon or you can use Numba to accelerate the explicit looping – norok2 Aug 23 '21 at 16:39
  • 2
    Are the regions defined in DF2 mutually exclusive? If not, what is the resolution strategy? Can you provide some [mcve]? Also what is the size of the dataframes in production? – norok2 Aug 25 '21 at 12:25
  • generally it's quite slow to iterate through a pandas dataframe in a for loop. Does this help at all? https://stackoverflow.com/a/19913845/14463396 – Emi OB Aug 25 '21 at 12:30
  • what's supposed to happen if a point in `df1` belongs to several `df2` bounding boxes? – Pierre D Aug 25 '21 at 12:45
  • 1
    Note that your problem would be considerably simpler if the bounding boxes were circular or square, i.e., expressed by a center, a "radius" and a _p_-norm (1 for square, 2 for circular). – Pierre D Aug 25 '21 at 13:31
  • @PierreD How so? I would love an answer in that setting, although it doesn't fit the question perfectly. – Stef Aug 25 '21 at 13:36
  • 2
    @Stef : If all the boxes are the same shape (circular or square, or even rectangle) and size (same "radius"), then you can directly use [`scipy.spatial.KDTree`](https://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.KDTree.html), which is highly optimized for this kind of problems. If the radii are all different but the shapes are the same, it's a touch more complex, but you can still find solutions with relatively low computational complexity. – Pierre D Aug 25 '21 at 15:03
  • @Stef FYI, I added an answer that uses `KDTree` [below](https://stackoverflow.com/a/68927974/758174). – Pierre D Aug 26 '21 at 03:14

3 Answers3

3

Here is another answer, this one using scipy.spatial.KDTree. It is particularly efficient in the case where there aren't too many overlaps between the bounding boxes, and their "radius" distribution isn't too "wild" (radius.max() / np.median(radius) not too large, e.g. between 1 and 2).

It works both for p-norm 1 (Manhattan) or 2 (Euclidean), although in practice p=2 is faster, because on average each point see fewer candidates (the sum area of the circles is less than that of the diamonds).

Why is it fast? KD-trees are well suited for this kind of problems. They partition the space by splitting it at each node along a dimension and a mid-point. Once they are built, querying them is fast because of the divide-and-conquer approach they offer.

The key function is the following:

def find(kd_boxes, kd_points, bb, r, p):
    # bb: m bounding boxes, in the form [x0,x1,y0,y1]
    # find one bb for each point (the first one), or m if none
    xy = kd_points.data
    found = np.full(len(xy), len(bb), dtype=int)
    cand = pad_jagged(kd_points.query_ball_tree(kd_boxes, r * 1.001, p=p))

    remaining = None
    for c in range(cand.shape[1]):
        remaining = np.nonzero(cand[:, c] >= 0)[0] if remaining is None else remaining[np.nonzero(cand[remaining, 1] >= 0)[0]]
        if remaining.size == 0:
            break
        i = remaining
        j = cand[i, c]
        ok = (bb[j, 0::2] <= xy[i]).all(1) & (xy[i] <= bb[j, 1::2]).all(1)
        k = np.nonzero(ok)[0]
        found[i[k]] = j[k]
        remaining = remaining[~ok]
    return found

Before calling this function, we compute a "radius" for each bounding box that is half of the p-norm of the diagonal. We then use the overall max radius r as a maximum distance for the KDTree queries. The KDTree query (kd_points.query_ball_tree()) efficiently sifters through all the bounding box centers, and finds in one call all the ones that are within the radius. This is a superset of the actual matches, but is fast and reduces substantially the search space. After that, it is a matter of filtering the points that actually are in the bounding box, and keeping track of the (first) matching bounding box for each point.

As an optimization (not implemented here), we could consider the the sizes of the bounding boxes (the radius array). If judged too disparate, then the bounding boxes could be split in two groups (e.g. around the np.median(radius)), and the same search could be done for each half (and again, recursively, if necessary).

For the OP example, after a bit of preparation to get centers, bounding boxes, and radius in a form easier to use (all Numpy arrays), that query returns:

# preparation
xy = df1[['longitude', 'latitude']].values
bb = df2[['minlong', 'maxlong', 'minlat', 'maxlat']].values
centers = bb @ np.array([[1,1,0,0], [0,0,1,1]]).T / 2
radius = np.linalg.norm(bb @ np.array([[-1,1,0,0], [0,0,-1,1]]).T, axis=1) / 2
kd_boxes = KDTree(centers)
kd_points = KDTree(xy)

# query
>>> kd_points.query_ball_tree(kd_boxes, radius.max() * 1.001)
[[0], [2], [2], [], [], [], [1], [], [0], [0]]

A similar result (in this example, identical) can be obtained with Manhattan norm instead:

radius = bb @ np.array([-1,1,-1,1]) / 2
>>> kd_points.query_ball_tree(kd_boxes, radius.max() * 1.001, p=1)
[[0], [2], [2], [], [], [], [1], [], [0], [0]]

Both these solutions are plotted in the following figure, where all candidate bounding boxes are highlighted, along with the actual area within distance r of their center:

Note how, in both cases, point #2 is wrongly included in bbox #2. This first step, using KD-Trees, is just a filtering one. The next step is to check that, for each point, the candidates actually contain the point. With that filtering (the expression for the ok array), the solution is:

Let's see what happens if we have more points and bounding boxes, possibly with overlaps:

def gen_example(n, m, seed=None):
    # let's make a more complex problem, with overlaps
    if seed is not None:
        np.random.seed(seed)
    df1 = pd.DataFrame({
        'latitude': np.random.uniform(0, 1 + 1 / np.sqrt(m), size=n),
        'longitude': np.random.uniform(0, 1 + 1 / np.sqrt(m), size=n),
    })

    df2 = pd.DataFrame({
        'minlat': np.random.uniform(size=m),
        'maxlat': np.random.uniform(.5, 1, size=m) / np.sqrt(m),
        'minlong': np.random.uniform(size=m),
        'maxlong': np.random.uniform(.5, 1, size=m) / np.sqrt(m),
        'STRING': [f'b_{i}' for i in range(m)],
    })
    df2['maxlat'] += df2['minlat']
    df2['maxlong'] += df2['minlong']
    return df1, df2

For df1, df2 = gen_example(32, 12, 0), the corresponding pictures are:

and the candidates (the result of the query, here for p=2) are:

>>> kd_cand = kd_points.query_ball_tree(kd_boxes, r * 1.001, p=2)
[[], [], [9], [], [], [10], [], [], [], [], [], [9], [10], [], [11], [11], [], [2, 6], [8], [8], [], [4], [7, 9], [4], [3, 5], [2, 4], [0], [], [7, 9], [7, 9], [0, 3, 5], [4]]

Since this is a jagged array, we turn it into a rectangular array with fill_value=-1:

def pad_jagged(a, fill_value=-1):
    lens = np.array([len(r) for r in a])
    v = np.array([e for r in a for e in r])
    w = np.max(lens)
    return np.insert(
        v,
        np.repeat(np.cumsum(lens), w - lens),
        fill_value
    ).reshape(-1, w)

This gives, for the padded array above:

>>> pad_jagged(kd_cand)
[[-1 -1 -1]
 [ 8 -1 -1]
 [ 9 -1 -1]
 ...
 [ 7  9 -1]
 [ 0  3  5]
 [ 4 -1 -1]]

Now, we iterate through the columns of this array, but in a greedy way that removes any successful matches from previous iterations (that's the reason for remaining).

Other functions to handle preprocessing etc. are:

def find_regions(xy, bb, p=2):
    # xy: numpy array (n, 2)
    # bb: numpy array (m, 4): the four columns are xmin, xmax, ymin, ymax
    # for each point in xy, return the index of a region that contains it, or -1
    centers = bb @ np.array([[1,1,0,0], [0,0,1,1]]).T / 2
    assert p in {1,2}, "only 1- or 2-norm supported"
    radius = bb @ np.array([-1,1,-1,1]) / 2 if p == 1 else np.linalg.norm(bb @ np.array([[-1,1,0,0], [0,0,-1,1]]).T, axis=1) / 2
    kd_boxes = KDTree(centers)
    kd_points = KDTree(xy)
    return find(kd_boxes, kd_points, bb, radius.max(), p=p)

def find_region_label(df1, df2, nomatch=None, p=2):
    found = find_regions(
        df1[['longitude', 'latitude']].values,
        df2[['minlong', 'maxlong', 'minlat', 'maxlat']].values,
        p=p,
    )
    lbl = np.r_[df2['STRING'].values, [nomatch]]
    return lbl[found]

On the OP example:

>>> df1.assign(LABEL=find_region_label(df1, df2))
   latitude  longitude LABEL
0       1.3        2.7   AAA
1       3.5        3.6   CCC
2       2.8        3.0  None
3       9.7        1.9  None
4       6.2        5.7  None
5       1.7        3.4  None
6       3.5        1.4   BBB
7       2.7        6.6  None
8       1.7        2.7   AAA
9       1.3        1.3   AAA

Speed test

And now for a speed test where both number of points n and number of bounding boxes m is larger:

df1, df2 = gen_example(100_000, 10_000, 0)

Speed comparison with @norok2's numba solution (slightly adapted to follow the OP's column names):

a = %timeit -o find_region_label(df1, df2)
# 222 ms ± 904 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)

b = %timeit -o locate_in_regions_nb(df1, df2)
# 5.38 s ± 40.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

>>> b.average / a.average
24.255

That's a 24x speedup on this data.

Verify that we get the same solution:

sa = find_region_label(df1, df2)
sb = locate_in_regions_nb(df1, df2)
>>> (sa == sb).all()
True
Pierre D
  • 24,012
  • 7
  • 60
  • 96
  • Awesome answer! Thanks for all the details. – Stef Aug 26 '21 at 08:37
  • *"It works both for p-norm 1 (Manhattan) or 2 (Euclidean), although in practice p=2 is faster, because on average each point see fewer candidates (the sum area of the circles is less than that of the diamonds)."* Do I understand correctly that if we chose two different radii r1 and r2 such that the area of a 1-norm-ball of radius r1 is equal to the area of a 2-norm-ball of radius r2, and implemented the method with 1-norm with radius r1 and with 2-norm with radius r2, we wouldn't expect one to be faster than the other, right? – Stef Aug 26 '21 at 08:44
  • 1
    Roughly yes, but with two caveats: 1. for a majority of area shapes, including the rectangular bounding boxes of the question, the area of the minimal 1-norm ball that fully encloses a shape is always greater that the area of the minimal 2-norm ball that fully encloses the same shape. The main reason why this affects a bit our speed is that we have on average more false positive candidates with 1-norm than 2-norm. Exceptions to this include for example the case of "diamond"-shape bounding boxes (squares at 45 degrees away from the axes), because those are already 1-norm balls. – Pierre D Aug 26 '21 at 11:30
  • 1
    (contd) 2. when going down a KDTree during a search for neighbors within a distance, there are nodes when you need to follow both sub-trees: if the point you are querying is close enough to the splitting hyperplane at that node. Even with `r1` and `r2` chosen such that the respective 1- and 2-norm balls have the same area, the extent along the axes of those areas is greater for the 1-ball, so the expected frequency of having to follow both branches at a split would be slightly higher. – Pierre D Aug 26 '21 at 11:35
2

One solution to vectorize this operation is to use Numpy and its wonderful broadcasting abilities. This gives a fast solution for small and medium-sized DataFrames, but it grows (both in time and memory used for the mask) as O[n*m] (for n rows of df1 and m rows of df2), so eventually that becomes slow for large DataFrames.

a = df1[['latitude', 'longitude']].values
vmin = df2[['minlat', 'minlong']].values
vmax = df2[['maxlat', 'maxlong']].values
mask = (vmin[None, :, :] <= a[:, None, :]).all(2) & (a[:, None, :] <= vmax[None, :, :]).all(2)
has_any = mask.any(1)
first = mask.argmax(axis=1)
label = np.full(len(df1), None, dtype=object)
label[has_any] = df2['STRING'].values[first[has_any]]

>>> df1.assign(LABEL=label)
   latitude  longitude LABEL
0       1.3        2.7   AAA
1       3.5        3.6   CCC
2       2.8        3.0  None
3       9.7        1.9  None
4       6.2        5.7  None
5       1.7        3.4  None
6       3.5        1.4   BBB
7       2.7        6.6  None
8       1.7        2.7   AAA
9       1.3        1.3   AAA

Explanation

The key part is the construction of the mask. It's worth breaking it down to understand the mechanism and how it uses Numpy's broadcasting:

>>> vmin[None, :, :] <= a[:, None, :]
[[[ True  True]
  [False  True]
  [False False]
  [False  True]]

 [[ True  True]
  [ True  True]
  [ True  True]
  [False  True]]
 ...
 [[ True  True]
  [False  True]
  [False False]
  [False False]]]

As you can see, the above expands all the comparisons between a and vmin into a 3rd dimension. We then project back to 2D with the logic "all of the 3rd axis (longitude and latitude) have to be True":

>>> (vmin[None, :, :] <= a[:, None, :]).all(2)
[[ True False False False]
 [ True  True  True False]
 [ True False False False]
 [ True  True False False]
 [ True  True  True  True]
 [ True False False False]
 [ True  True False False]
 [ True False False False]
 [ True False False False]
 [ True False False False]]

The above indicates all the points df1.iloc[i] that are above the minimums of df2.iloc[j] as ...[i, j].

We do the same for vmax, and the resulting mask is where all the points at df1.iloc[i] are in the bounding box of df2.iloc[j].

The next two bits are has_any and first. The former indicates which points in df1 fall in at least one bounding box. The latter is the first such bounding box (as index in df2).

The rest is pretty self-explanatory.

Notes

Be aware that this uses O[n*m] comparisons (for n rows of df1 and m rows of df2), which might be too slow for large matrices (although, because it is vectorized, it goes very fast for medium-sized matrices).

For large matrices, better approaches would involve sorting, or using KD-Trees. See this other answer.

Pierre D
  • 24,012
  • 7
  • 60
  • 96
  • 2
    I would also mention the memory requirements – norok2 Aug 25 '21 at 13:15
  • you answer works really well. thank you so much most of on how it works. is so uch clear now. – Giuditta Davini Aug 26 '21 at 10:18
  • i got an issue: i tried with all the df. that have some values in DF latitude and longitude that are none. and i got this error: TypeError: '<=' not supported between instances of 'float' and 'NoneType any suggest? i tried with: DF1.loc[:,'latitude'].fillna(np.nan) – Giuditta Davini Aug 27 '21 at 07:57
  • @GiudittaDavini: what do you expect for these rows? You should probably drop them first altogether (`df.dropna(subset=['latitude', 'longitude']`). – Pierre D Aug 27 '21 at 12:05
  • @PierreD no i need to keep those. i just need to have the label as none but in databricks when i try to apply mask i got the error. on spyder with pyhton 3.8 there were no problems – Giuditta Davini Aug 27 '21 at 13:18
  • Then replace NaN with a value that's outside of all bounding boxes. Note than `NaN` values work just fine (no exceptions) for me (`numpy=1.20.3`). Try with `np.inf`; if that fails, try with an outlandish value, e.g. `1e9`. – Pierre D Aug 27 '21 at 13:32
0

This problem is not particularly well suited to be solved within Pandas itself as there are no simple primitives to handle the computation you need to do. A much better approach is to move to the NumPy or Numba domain to solve the problem at a lower level.

I will provide functions which will generate the last column under the assumption that it is relatively easy to copy that into a dataframe.

The original approach would read:

def locate_in_regions_OP(points, regions):
    result = np.full(len(points), None, dtype=object)
    for i in range(len(regions) - 1):
        result[
            (points['lat'] >= regions.loc[i:i, 'min_lat'].at[i])
            & (points['lat'] < regions.loc[i:i,'max_lat'].at[i])
            & (points['lon'] >= regions.loc[i:i, 'min_lon'].at[i])
            & (points['lon'] < regions.loc[i:i, 'max_lon'].at[i])
        ] = regions.loc[i:i, 'lbl'].at[i]
    return result

which would produce the correct result for the last column. (The other approaches proposed in OP are either irrelevant -- just uses a standalone assignment for quantities used only once or I did not manage to have them working).

A relatively simple approach involves broadcasting and is presented in @PierreD answer and can be further simplified to read:

import numpy as np


def locate_in_regions_bc(points, regions):
    pos_arr = points[['lat', 'lon']].values
    min_arr = regions[['min_lat', 'min_lon']].values
    max_arr = regions[['max_lat', 'max_lon']].values
    mask = (
        np.all(min_arr[None, :, :] <= pos_arr[:, None, :], axis=2)
        & np.all(pos_arr[:, None, :] < max_arr[None, :, :], axis=2))
    has_any = np.any(mask, axis=1)
    first = np.argmax(mask, axis=1)
    result = np.full(len(points), None, dtype=object)
    result[has_any] = regions.loc[first[has_any], 'lbl'].values
    return result

This can be slightly simplified under the assumption that each position belong to only one region:

import numpy as np


def locate_in_regions_bcu(points, regions):
    pos_arr = points[['lat', 'lon']].values
    min_arr = regions[['min_lat', 'min_lon']].values
    max_arr = regions[['max_lat', 'max_lon']].values
    mask = (
        np.all(min_arr[None, :, :] <= pos_arr[:, None, :], axis=2)
        & np.all(pos_arr[:, None, :] < max_arr[None, :, :], axis=2))
    pos, lbl = np.where(mask)    
    result = np.full(len(points), None, dtype=object)
    result[pos] = regions.loc[lbl, 'lbl'].values
    return result

However, there is a significant amount of unnecessary memory allocation and comparisons going on. A substantially faster approach involves explicit looping with Numba, where you can add short-circuiting explicitly. The code may read as follows:

import numpy as np
import numba as nb


def locate_in_regions_nb(points, regions):
    pos_arr = points[['lat', 'lon']].values
    min_arr = regions[['min_lat', 'min_lon']].values
    max_arr = regions[['max_lat', 'max_lon']].values
    found_arr = _locate_in_regions_nb(pos_arr, min_arr, max_arr)
    mask = found_arr >= 0
    result = np.full(len(points), None, dtype=object)
    result[mask] = regions.loc[found_arr[mask], 'lbl'].values
    return result
    

@nb.jit
def _locate_in_regions_nb(pos_arr, min_arr, max_arr):
    n, l = pos_arr.shape
    m = min_arr.shape[0]
    found_arr = np.full((n,), -1)
    for i in range(n):
        for j in range(m):
            contained = True
            for k in range(l):
                if min_arr[j, k] > pos_arr[i, k] or pos_arr[i, k] >= max_arr[j, k]:
                    contained = False
                    break
            if contained:
                found_arr[i] = j
                break
    return found_arr

With a slightly cleaner but otherwise comparable input:

import numpy as np
import pandas as pd
import numba as nb


df1 = pd.DataFrame(
    columns=['lat', 'lon', 'lbl'],
    data=[
        [1.3, 2.7, None],
        [3.5, 3.6, None],
        [2.8, 3.0, None],
        [9.7, 1.9, None],
        [1.7, 3.4, None],
        [3.5, 1.4, None],
        [2.7, 6.6, None],
        [1.7, 2.7, None],
        [1.3, 1.3, None],
    ])

df2 = pd.DataFrame(
    columns=['min_lat', 'max_lat', 'min_lon', 'max_lon', 'lbl'],
    data=[
        [1.0, 2.0, 1.0, 3.0, 'AAA'],
        [3.0, 4.0, 1.0, 2.0, 'BBB'],
        [3.0, 4.0, 3.0, 4.0, 'CCC'],
        [5.0, 7.0, 2.0, 3.0, 'DDD'],
    ])


print(df1)
#    lat  lon   lbl
# 0  1.3  2.7  None
# 1  3.5  3.6  None
# 2  2.8  3.0  None
# 3  9.7  1.9  None
# 4  1.7  3.4  None
# 5  3.5  1.4  None
# 6  2.7  6.6  None
# 7  1.7  2.7  None
# 8  1.3  1.3  None

print(df2)
#    min_lat  max_lat  min_lon  max_lon  lbl
# 0      1.0      2.0      1.0      3.0  AAA
# 1      3.0      4.0      1.0      2.0  BBB
# 2      3.0      4.0      3.0      4.0  CCC
# 3      5.0      7.0      2.0      3.0  DDD

One gets the expect output in all cases:

print(locate_in_regions_OP(df1, df2))
# ['AAA' 'CCC' None None None 'BBB' None 'AAA' 'AAA']
print(locate_in_regions_bc(df1, df2))
# ['AAA' 'CCC' None None None 'BBB' None 'AAA' 'AAA']
print(locate_in_regions_bcu(df1, df2))
# ['AAA' 'CCC' None None None 'BBB' None 'AAA' 'AAA']
print(locate_in_regions_nb(df1, df2))
# ['AAA' 'CCC' None None None 'BBB' None 'AAA' 'AAA']

While it is not easy to produce arbitrarily large inputs which are representative of the problem, a naïve timing indicates the Numba approach to be substantially faster:

np.random.seed(0)
df3 = pd.DataFrame(
    columns=['lat', 'lon', 'lbl'],
    data=np.random.random((1000000, 3)) * 5)
df3.loc[:, 'lbl'] = None


%timeit locate_in_regions_OP(df3, df2)
# 209 ms ± 7.99 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit locate_in_regions_bc(df3, df2)
# 161 ms ± 2.92 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit locate_in_regions_bcu(df3, df2)
# 115 ms ± 2.43 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit locate_in_regions_nb(df3, df2)
# 66.6 ms ± 3.1 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
norok2
  • 25,683
  • 4
  • 73
  • 99