Fast (vectorized) way to find points in one DF belonging to equally sized rectangles (given by two points) from the second DF

Question

I have data frame "A" that looks like this:

type    latw    lngs    late    lngn
0   1000    45.457966   9.174864    45.458030   9.174907
1   1000    45.457966   9.174864    45.458030   9.174907
2   1000    45.458030   9.174864    45.458094   9.174907
3   1000    45.458094   9.174864    45.458157   9.174907
4   1000    45.458157   9.174864    45.458221   9.174907
5   1000    45.458221   9.174864    45.458285   9.174907
6   1000    45.458285   9.174864    45.458349   9.174907
7   1000    45.458349   9.174864    45.458413   9.174907
8   1000    45.458413   9.174864    45.458477   9.174907
9   1000    45.458477   9.174864    45.458540   9.174907
10  1000    45.458540   9.174864    45.458604   9.174907
11  1000    45.458604   9.174864    45.458668   9.174907
12  1000    45.458668   9.174864    45.458732   9.174907
13  1000    45.458732   9.174864    45.458796   9.174907
14  1000    45.458796   9.174864    45.458860   9.174907
15  1000    45.458860   9.174864    45.458923   9.174907
16  1000    45.458923   9.174864    45.458987   9.174907
17  1000    45.458987   9.174864    45.459051   9.174907
18  1000    45.459051   9.174864    45.459115   9.174907
19  1000    45.459115   9.174864    45.459179   9.174907
20  1000    45.459179   9.174864    45.459243   9.174907
21  1000    45.459243   9.174864    45.459306   9.174907
22  1000    45.459306   9.174864    45.459370   9.174907
23  1000    45.459370   9.174864    45.459434   9.174907
24  1000    45.459434   9.174864    45.459498   9.174907
25  1000    45.459498   9.174864    45.459562   9.174907
26  1000    45.459562   9.174864    45.459626   9.174907
27  1000    45.459626   9.174864    45.459689   9.174907
28  1000    45.459689   9.174864    45.459753   9.174907
29  1000    45.459753   9.174864    45.459817   9.174907
... ... ... ... ... ...
970 1000    45.460583   9.175545    45.460647   9.175587
971 1000    45.460647   9.175545    45.460711   9.175587
972 1000    45.460711   9.175545    45.460775   9.175587
973 1000    45.460775   9.175545    45.460838   9.175587
974 1000    45.460838   9.175545    45.460902   9.175587
975 1000    45.460902   9.175545    45.460966   9.175587
976 1000    45.460966   9.175545    45.461030   9.175587
977 1000    45.461030   9.175545    45.461094   9.175587
978 1000    45.461094   9.175545    45.461157   9.175587
979 1000    45.461157   9.175545    45.461221   9.175587
980 1000    45.461221   9.175545    45.461285   9.175587
981 1000    45.461285   9.175545    45.461349   9.175587
982 1000    45.461349   9.175545    45.461413   9.175587
983 1000    45.461413   9.175545    45.461477   9.175587
984 1000    45.461477   9.175545    45.461540   9.175587
985 1000    45.461540   9.175545    45.461604   9.175587
986 1000    45.461604   9.175545    45.461668   9.175587
987 1000    45.457966   9.175587    45.458030   9.175630
988 1000    45.458030   9.175587    45.458094   9.175630
989 1000    45.458094   9.175587    45.458157   9.175630
990 1000    45.458157   9.175587    45.458221   9.175630
991 1000    45.458221   9.175587    45.458285   9.175630
992 1000    45.458285   9.175587    45.458349   9.175630
993 1000    45.458349   9.175587    45.458413   9.175630
994 1000    45.458413   9.175587    45.458477   9.175630
995 1000    45.458477   9.175587    45.458540   9.175630
996 1000    45.458540   9.175587    45.458604   9.175630
997 1000    45.458604   9.175587    45.458668   9.175630
998 1000    45.458668   9.175587    45.458732   9.175630
999 1000    45.458732   9.175587    45.458796   9.175630

It has 22,000,000 rows × 5 columns and there is data frame "B" which looks like this:

    type        Lat       Lng
0      0  45.465739  9.180830
1      2  45.463950  9.187113
2      1  45.468015  9.180648
3      1  45.462209  9.187447
4      0  45.459578  9.184007
5      1  45.459822  9.187034
6      2  45.454988  9.180310
7      2  45.459818  9.189377
8      0  45.462200  9.187440
9      0  45.467160  9.180100
10     2  45.459407  9.183300
11     2  45.457699  9.187434
12     1  45.455319  9.186697
13     0  45.461138  9.191943
14     2  45.456397  9.189028
15     0  45.457062  9.185878
16     1  45.461980  9.187024
17     1  45.464319  9.183142
18     2  45.464227  9.187065
19     1  45.460886  9.185216

It has 2,000,000 rows × 3 columns. I want to replace type's value of data frame "A" with "B" Where:

A[latw]<B[lat]<A[late] and A[lngs]<B[lng]<B[lngn]

I want to check a location from B belongs to which one of the rectangles in A.

PS I'm looking for the fastest way in python such as using parallel processing.

Hi! so what have you tried, where did you fail? Why do you use the [tag:parallel-processing] tag, but don't refer to it in your question? This is **not** a free code-writing service. — Marcus Müller, Jan 18 '17 at 19:00
@MarcusMüller I don't know how to write it. I mentioned on the last line that I'm looking for the fastest way in python because I know if I used simple code(using simple loop) it will take too much time so I'm looking for parallel processing code. — asikhalaban, Jan 18 '17 at 22:18
@asikhalaban, do you have equally sized rectangles in `A` DF? — MaxU - stand with Ukraine, Jan 18 '17 at 23:48
@MaxU I didn't use GeoPandas. yeah I have equally sized rectangles and I want to check a location from B belongs to which one of the rectangles in A. I wasn't to find a solution in Pandas because it just can match exact value of each list. — asikhalaban, Jan 19 '17 at 01:16
Is the algorithm supposed to check whether there are multiple points (with different "types") from B in a rectangle from A? — hvwaldow, Jan 24 '17 at 10:40
Do any of the values in your examples actually fit together? — Khris, Jan 24 '17 at 11:02
For parallel processing use the `multiprocessing` module and create a `pool` with a worker function. Then cut your data frame A into as many pieces as you need and send the pieces together with B to the pool. — Khris, Jan 24 '17 at 11:09

MaxU - stand with Ukraine · Answer 1 · 2017-01-22T13:07:19.210

4

You can do something like this:

Assuming we have the following DFs:

In [150]: A
Out[150]:
     type       latw       late      lngs      lngn
0    1000  45.457966  45.458030  9.174864  9.174907
1    1001  45.457966  45.458030  9.174864  9.174907
2    1002  45.458030  45.458094  9.174864  9.174907
3    1003  45.458094  45.458157  9.174864  9.174907
16   1004  45.458923  45.458987  9.174864  9.174907
17   1005  45.458987  45.459051  9.174864  9.174907
970  1006  45.460583  45.460647  9.175545  9.175587
971  1007  45.460647  45.460711  9.175545  9.175587
972  1008  45.460711  45.460775  9.175545  9.175587
996  1009  45.458540  45.458604  9.175587  9.175630
997  1010  45.458604  45.458668  9.175587  9.175630
998  1011  45.458668  45.458732  9.175587  9.175630
999  1012  45.458732  45.458796  9.175587  9.175630

In [151]: B
Out[151]:
   type      Lat     Lng
0     0  45.4581  9.1749
1     1  45.4590  9.1749
2     2  45.4586  9.1756

Solution:

B['new'] = B.apply(lambda x: A.loc[  (A.latw < x.Lat) & (x.Lat < A.late)
                                   & (A.lngs < x.Lng) & (x.Lng < A.lngn), 'type'].head(1),
                   axis=1) \
            .values.diagonal()

In [153]: B
Out[153]:
   type      Lat     Lng     new
0     0  45.4581  9.1749  1003.0
1     1  45.4590  9.1749  1005.0
2     2  45.4586  9.1756  1009.0

PS I'm not sure this is the fastest way to achieve that...

edited Jan 22 '17 at 13:07

answered Jan 22 '17 at 12:57

MaxU - stand with Ukraine

205,989
36
386
419

1

This is one area where pandas is severely lacking in my opinion - conditional joining. `merge_asof` would work here if it was built for more than one joining column but it is not. Also, your solution alters table `B`. Isn't the question asking how to replace A with B? – Ted Petrou Jan 23 '17 at 17:46
1

@TedPetrou, thank you for your comment! I've tried to use `merge_asof` already (with integers made from float values) and have also recognized that it doesn't accept multiple joining columns - it would be really great. OP has written the following comment: `I want to check a location from B belongs to which one of the rectangles in A`, so i decided they want to alter table B... – MaxU - stand with Ukraine Jan 23 '17 at 17:53
@MaxU Thank you for your solution and also for editing the title. Sorry I was working on another part. I used your answer but I faced with this error "TypeError: ('len() of unsized object', u'occurred at index 0')". I made two DF A and B and it did work well but when I tried on my DF it shows the above error. Do you want I open a new question to share my code? – asikhalaban Jan 25 '17 at 20:00
@asikhalaban, i would strongly recommend you to use [@ piRSquared's solution](http://stackoverflow.com/a/41847477/5741205) - it's much better and much faster – MaxU - stand with Ukraine Jan 25 '17 at 20:03
@asikhalaban, in order to help you with that error i would need a reproducible data sets... – MaxU - stand with Ukraine Jan 25 '17 at 22:51
@asikhalaban, sorry, i don't share my email. If you want to ask questions - stackoverflow is a perfect platform for that. You will have a huge community instead of one person (who might not have enough time to answer your questions) – MaxU - stand with Ukraine Jan 25 '17 at 22:53
@MaxU, ok. sorry I asked for it. – asikhalaban Jan 26 '17 at 00:18

score 3 · Accepted Answer · edited May 23 '17 at 11:45

High Level Explanation

This answer is quite involved. The following are high-level points

Use np.searchsorted with the west latitudes in A relative to the sorted latitudes in B. Repeat this for east latitudes but with a parameter of side='right'. This tells me where each west/east latitudes fall in the spectrum defined in B. np.searchsorted are index values. So if the east based results are greater than the west based results, that implies that there was a latitude from B between west and east.
- Track where east searches are greater than west searches.
- Find the difference. Not only do I care if the east search is greater than the west search, the difference tells me how many B.Lat's are in between. I expand arrays to track all possible matches.
Repeat this process for south and north longitudes but only over the subset in which we found matching latitudes. No need to waste effort looking where we know we don't have a match.
- Perform tricky slicing to unwind positions
Use numpy structured arrays to find intersections of 2-D arrays
- I used this answer from @JoeKington please up vote his answer because this is awesome!
This final intersection has ordinal positions from A and matching ordinal positions from B.
- remember I said we could possible match more than one row from B, I take the first.

Demonstration

use A and B provided by @MaxU

p = process(A, B)
print(p)

[[3 0]
 [5 1]
 [9 2]]

To show the results and compare. I'll put them side by side

pd.concat([
        A.iloc[p[:, 0]].reset_index(drop=True),
        B.iloc[p[:, 1]].reset_index(drop=True)
    ], axis=1, keys=['A', 'B'])

However, to swap the type values, I put a flag in the process function to reassign. Just run process(A, B, reassign_type=True)

timing
functions
I defined the functions to return the same thing and make it an apples to apples comparison

def pir(A, B):
    return A.values[process(A, B)[:, 0], 0]

def maxu(A, B):
    return B.apply(
        lambda x: A.loc[
            (A.latw < x.Lat) &
            (x.Lat < A.late) &
            (A.lngs < x.Lng) &
            (x.Lng < A.lngn), 'type'
        ].head(1),
        axis=1
    ).values.diagonal()

more thorough testing

from timeit import timeit
from itertools import product

rng = np.arange(20, 51, 5)
idx = pd.MultiIndex.from_arrays([rng, rng], names=['B', 'A'])
functions = pd.Index(['pir', 'maxu'], name='method')

results = pd.DataFrame(index=idx, columns=functions)

for bi, ai in idx:
    n = ai
    ws = np.random.rand(n, 2) * 10 + np.array([[40, 30]])
    A = pd.DataFrame(
        np.hstack([-np.ones((n, 1)), ws, ws + np.random.rand(n, 2) * 2]),
        columns=['type', 'latw', 'lngs', 'late', 'lngn']
    )
    m = int(bi ** .5)
    prng = np.arange(m) / m * 10 + .5
    B = pd.DataFrame(
        np.array(list(product(prng, prng))) + np.array([[40, 30]]),
        columns=['Lat', "Lng"])
    B.insert(0, 'type', np.random.randint(3, size=len(B)))
    for j in functions:
        results.set_value(
            (bi, ai), j,
            timeit(
                '{}(A, B)'.format(j),
                'from __main__ import {}, A, B'.format(j),
                number=10
            )
        )

results.plot(rot=45)

Code

from collections import namedtuple
TwoSided = namedtuple('TwoSided', ['match', 'idx', 'found'])

def two_sided_search(left, right, a):
    # sort `a` and save order for later
    argsort = a.argsort()
    asorted = a[argsort]

    # where does `left` fall in `asorted`
    s_left = asorted.searchsorted(left)
    # where does `right` fall in `asorted`
    s_right = asorted.searchsorted(right, side='right')

    rng = np.arange(left.size)

    # a match happens when where we found `left`, `right` was found later
    match_idx = rng[s_left < s_right]

    # ignore positions with no match
    left_pos = s_left[match_idx]
    right_pos = s_right[match_idx]
    # distance from where left was found to where right was found
    # this represents how many items in a are wrapped by left and right
    diff_pos = right_pos - left_pos
    # build an index on `match_idx` paired with all positions in
    # `asorted`.
    d2 = left_pos - np.append(0, diff_pos[:-1].cumsum())
    p1 = np.arange(len(left_pos)).repeat(diff_pos)
    p2 = np.arange(diff_pos.sum()) + d2.repeat(diff_pos)

    # returns
    # `match_idx` which is an integer slice of either `left` or `right`
    # for each element in `match_idx` there are one or more elements in
    # `a` that were surrounded by `left` and `right`
    # `p1` are the positions in `match_idx` that are paired with `argsort[p2]`
    # for example, consider `p1 = [1, 1, 2, 2]` and `argsort[p2] = [3, 5, 12, 5]`
    # this means that `left[match_idx[1]] < a[3] < right[match_idx[1]]`
    # and `left[match_idx[1]] < a[5] < right[match_idx[1]]`
    # and `left[match_idx[2]] < a[12] < right[match_idx[2]]`
    # and `left[match_idx[2]] < a[5] < right[match_idx[2]]`

    return TwoSided(match_idx, p1, argsort[p2])

def intersectNd(a, b):
    # taken from https://stackoverflow.com/a/8317403/2336654
    # returns the interesection of 2 2-D arrays
    # the strategy is to convert to a structured array then
    # use np.intersect1d then convert back to unstructured
    ncols = a.shape[1]
    dtype = dict(
        names=['f{}'.format(i) for i in range(ncols)],
        formats=ncols * [a.dtype]
    )

    return np.intersect1d(
        a.view(dtype), b.view(dtype)
    ).view(a.dtype).reshape(-1, ncols)

def where_in(b1, b2):
    # return a mask of lenght len(b2) where True
    # when element of b2 is in b1
    bsort = b1.argsort()
    bsrtd = b1[bsort]
    bsrch = bsrtd.searchsorted(b2) % bsrtd.size
    return bsrtd[bsrch] == b2

def process(A, B, reassign_type=False):
    lats = two_sided_search(
        A.latw.values,
        A.late.values,
        B.Lat.values,
    )

    # subset A & B
    # no point looking for longitude matches
    # in rows without a latitude match
    lngs = A.lngs.values[lats.match]
    lngn = A.lngn.values[lats.match]
    u, i = np.unique(lats.found, return_index=1)
    lng = B.Lng.values[u]

    lons = two_sided_search(
        lngs, lngn, lng
    )

    # `lats.match` is where we found successful latitudes
    # `lons.match` is where we found successful longitudes
    # since `lons.match` was based on `lats.match` we can slice
    # to get the original positions of `A` that satisfy both
    # the latw < blat < late & lngs < blng < lngn
    match = lats.match[lons.match]

    # however, for every row in A that might be a match for B,
    # there may be multilple B's
    #
    # We start addressing this by finding
    # where in lats.idx do we have matches
    # this gives me a boolean array of lats.match
    # index values repeated for each row in B
    # that satisfies our conditions
    wi = where_in(lons.match, lats.idx)

    # a (? x 2) array representing all combinations of
    # rows in A that matched the 
    lon_pos = np.hstack([
            lats.match[lons.match[lons.idx], None],
            u[lons.found, None]
        ])

    lat_pos = np.hstack([
            lats.match[lats.idx[wi], None],
            lats.found[wi, None]
        ])

    x = intersectNd(lat_pos, lon_pos)
    p, pi = np.unique(x[:, 0], return_index=1)

    if reassign_type:
        A.iloc[x[pi, 0], A.columns.get_loc('type')] = \
            B.iloc[x[pi, 1], B.columns.get_loc('type')].values
    else:
        return x[pi]

This is awesome, thank you! I will still need to invest more time to understand it completely, though... :-) — MaxU - stand with Ukraine, Jan 25 '17 at 21:23
@piRSquared Thank you, your answer looks great but I didn't get it because I am not professional on python as you. Could you explain about each one of the functions? I don't know how the process function is working also I didn't get about your answer for my question. — asikhalaban, Jan 25 '17 at 22:13
@asikhalaban each function is commented. I'll try to improve the explanation later — piRSquared, Jan 25 '17 at 22:28
@piRSquared yeah commented are great but I don't know about those functions you used so I need more explanation. Also you mentioned about np.searchsorted but I didn't get how do I have to use it for my data. — asikhalaban, Jan 25 '17 at 22:32

Mahesh · Answer 3 · 2017-01-24T11:29:19.363

2

I ran the solution suggested by MaxU on datasize of 1000 and the timeit shows 1.2 s as best of 3. I achieved some improvement in time by creating a list 'c' which holds the boolean results of comparison and the using this list to assign as follows:

c=list((A.latw < B.Lat) & (A.late > B.Lat  ) & (A.lngs < B.Lng) & (A.lngn>B.Lng))

B.apply(A.loc[ c , 'type'].head(1),axis=1).values.diagonal()

The first part runs in the order of micro seconds and the second part takes miliseconds.

Can you please try it on your large data set and see if it really improves the performance.

edited Jan 24 '17 at 11:29

answered Jan 24 '17 at 10:39

Mahesh

145
8

I got another error and I am working on it as soon as I solve that problem I will try your answer. Thank you – asikhalaban Jan 25 '17 at 21:37

Fast (vectorized) way to find points in one DF belonging to equally sized rectangles (given by two points) from the second DF

3 Answers3

High Level Explanation

Demonstration

Code

Linked