0

I have two data frames -

df1 - columns are Order_ID, lat, long

Order_ID Lat Long
1 32.0455 -76.9876
2 32.5679 -77.3421
3 33.4567 -77.9876

df2 - columns are lat, long, Category

Category Lat Long
S1 32.0109 -76.0765
S1 32.8769 -77.5674
S1 33.1987 -78.7654
S2 33.5967 -78.0765
S2 33.8769 -79.5674
S2 34.1987 -79.7654

df1 is order level data with latitude and longitude present for each order.

df2 would have multiple lat long for each category, essentially defining a boundary in map for each category separately.

I want to map order id to category id. For example, based on the polygon of S1 or S2, order id would lie in one of the category.

How can I map the order_id in df1 to category in df2. Please help with dummy python pandas code.

  • 1
    please share what `df1` and `df2` look like and what you've tried so far. potential duplicate: https://stackoverflow.com/questions/48097742/geopandas-point-in-polygon – mitoRibo Aug 31 '22 at 00:07
  • 1
    Please provide enough code so others can better understand or reproduce the problem. – Community Aug 31 '22 at 01:20

1 Answers1

1
  • I have tried with your sample data. There are not enough orders such that the convex hull of the points cover any category
  • have simulated some data to demonstrate
    1. create geopandas data frame of orders
    2. create geopandas data frame of convex hull of points that make up categories
    3. sjoin() two GeoDataFrames to find association you require
  • have provided a visualisation to better demonstrate how this works
import geopandas as gpd
import pandas as pd
import numpy as np

gdf = gpd.read_file(gpd.datasets.get_path("naturalearth_cities"))
gdf = gdf.loc[gdf["name"].isin(["London", "Paris", "Brussels"])]
# gdf = gdf.sample(10)

# pandas dataframes structured as per question
df1 = pd.DataFrame(
    {"Long": gdf["geometry"].x, "Lat": gdf["geometry"].y, "Order_ID": gdf["name"]}
)
N = 8
df2 = pd.concat(
    [
        pd.DataFrame(
            {
                "Long": np.random.uniform(r.minx, r.maxx, N),
                "Lat": np.random.uniform(r.miny, r.maxy, N),
                "Category": np.full(N, chr(65 + _)),
            }
        )
        for _, r in gdf.reset_index()
        .to_crs(gdf.estimate_utm_crs())
        .buffer(3 * 10**5)
        .to_crs(gdf.crs)
        .bounds.iterrows()
    ]
)

# sample geometry,  not enough orders to work effectively
# df1 = pd.DataFrame(
#     **{
#         "index": [0, 1, 2],
#         "columns": ["Order_ID", "Lat", "Long"],
#         "data": [[1, 32.0455, -76.9876], [2, 32.5679, -77.3421], [3, 33.4567, -77.987]],
#     }
# )

# df2 = pd.DataFrame(
#     **{
#         "index": [0, 1, 2, 3, 4, 5],
#         "columns": ["Category", "Lat", "Long"],
#         "data": [
#             ["S1", 32.0109, -76.0765],
#             ["S1", 32.8769, -77.5674],
#             ["S1", 33.1987, -78.7654],
#             ["S2", 33.5967, -78.0765],
#             ["S2", 33.8769, -79.5674],
#             ["S2", 34.1987, -79.7654],
#         ],
#     }
# )

gdf1 = gpd.gpd.GeoDataFrame(
    df1["Order_ID"],
    geometry=gpd.points_from_xy(x=df1["Long"], y=df1["Lat"]),
    crs="epsg:4386",
)

# want convex hull of all points that make up a category
gdf2 = (
    gpd.GeoDataFrame(
        df2["Category"],
        geometry=gpd.points_from_xy(x=df2["Long"], y=df2["Lat"]),
        crs="epsg:4386",
    )
    .dissolve("Category")
    .convex_hull.reset_index()
)

# get association between order and category using geometry
gpd.sjoin(gdf1, gdf2)
Order_ID geometry index_right Category
158 Brussels POINT (4.33137074969045 50.83526293533032) 0 A
187 London POINT (-0.118667702475932 51.5019405883275) 1 B
199 Paris POINT (2.33138946713035 48.86863878981461) 2 C

visualise

# visualise it...
m = gdf2.explore(height=300, width=500)
gdf1.explore(m=m, color="red")

enter image description here

Rob Raymond
  • 29,118
  • 3
  • 14
  • 30
  • I am getting this error when I configured this for my use case. Error - ValueError: 'right_df' should be GeoDataFrame, got . I am using the last bit of your code (sjoin) and loading the dfs as pandas df. – Shivam Bindal Aug 31 '22 at 16:29
  • Thanks for sharing this but for gdf2, I am still getting this error - ValueError: 'right_df' should be GeoDataFrame, got . Somehow, it is considering gdf2 as pandas df instead of GeoDf. – Shivam Bindal Aug 31 '22 at 16:50
  • 1
    error is exactly what it says .... `sjoin()` only works on geodataframes. hence reason I have created `gdf1` and `gdf2` from `df1` and `df2` respectively ... – Rob Raymond Aug 31 '22 at 17:17