1

I have a dataframe that looks like this (it represents zones in a 2D space; note that they overlap, and that is OK):

>>> zones = pd.DataFrame(dict(
    minx=[-10, -10, -5],
    maxx=[10, 10, 5],
    miny=[-10, 0, 0],
    maxy=[10, 10, 10],
), index=range(1,4))
>>> zones.index.name = "zone"
>>> zones
      minx  maxx  miny  maxy
zone
1      -10    10   -10    10
2      -10    10     0    10
3       -5     5     0    10

I have a second dataframe of ordered pairs that looks something like the following (random numbers here since they don't really matter):

>>> pairs = pd.DataFrame(list(zip((uniform(0, 10) for _ in range(10)), (uniform(0,10) for _ in range(10)))), index=range(1,11), columns=["cx", "cy"])
>>> pairs.index.name = "pair"
>>> pairs["zone"] = "??"
>>> pairs
               cx        cy zone
pair
1        8.405715  2.691102   ??
2        6.645482  1.843225   ??
3        4.123719  8.996641   ??
4        7.003991  9.695182   ??
5        7.296730  1.175356   ??
6        7.960617  9.503888   ??
7        7.694749  6.907869   ??
8        8.308742  5.439141   ??
9        6.404875  5.663983   ??
10       3.361129  3.123590   ??

I want to fill the "zone" series of this dataframe with the correct zone number for each cx, cy pair based on the zone definitions in the first dataframe.

The code I have written to do this is below. However, I am sure there is a much better way to do it using pandas (i.e., without iterating over the zones columns). How should it be done?

for num, zone in zones.transpose().iteritems():
    idx = (
        (pairs.cx.gt(zone["minx"]))
        & (pairs.cx.lt(zone["maxx"]))
        & (pairs.cy.gt(zone["miny"]))
        & (pairs.cy.lt(zone["maxy"]))
    )
    pairs.loc[idx, "zone"] = num

NOTE: The highest zone number wins. So for example, index 5 in the second table above has the approximate ordered pair (7.3, 1.2), and would fall inside of zones 1, 2, AND 3. Therefore, it should be zone 3. However, index 9 with a.o.p. (6.4, 5.7) falls outside of zone 3 but inside zones 1 AND 2. Therefore, it should be assigned zone 2.

Rick
  • 43,029
  • 15
  • 76
  • 119
  • In case it is confusing: I unintentionally set my random number range to be only positive numbers, which means no generated points will be zone 1. I am going to leave this as-is so I don't have to copy and paste a new set of randoms etc. – Rick Dec 18 '19 at 22:30
  • For O(nm) solution , use loop is ok in this case – BENY Dec 18 '19 at 22:40
  • @YOandBEN_W I'm self taught and unfamiliar with Big-O. What does nm denote? – Rick Dec 18 '19 at 22:52
  • 1
    https://stackoverflow.com/questions/23896399/difference-between-omn-and-omn :-) – BENY Dec 18 '19 at 22:53

1 Answers1

1

A little bit improvement use numpy broadcast

cx = pairs.cx.values
cy = pairs.cy.values
minx, maxx, miny, maxy = zones.T.values
s = (
    pd.DataFrame(
        (cx > minx[:, None])
        & (cx < maxx[:, None])
        & (cy > miny[:, None])
        & (cy < maxy[:, None])
    )
    .mul(zones.index, axis=0)
    .max()
)

s
0    2
1    2
2    2
3    3
4    2
5    3
6    2
7    3
8    3
9    2
dtype: int64

pairs["zone"]=s.values
Rick
  • 43,029
  • 15
  • 76
  • 119
BENY
  • 317,841
  • 20
  • 164
  • 234
  • this is interesting; i haven't used numpy arrays much directly (except to do mathematics). – Rick Dec 19 '19 at 13:59