How to match a partial string to a range of possible values?

Question

I have the following dataframes:

df1:

ZIP code	Other columns
1011AA	...
1011AA	...
2316XH	...
5815NE	...

df2:

starting value ZIP code range	last value ZIP code range	Province
1000	1200	North-Holland
1201	1500	South-Holland
1501	1570	North-Holland
1571	1600	Den Haag

I want to:

Get the first four digits of df1["ZIP code"]
Check if these four digits are present in any range in df2["starting value ZIP code range"] and df["last value ZIP code range"]
If there is a match, get df2["Province"] and add this value to a column in df1.

The difficulty is that I need to compare this to a range of values and I can only use the first 4 digits of the string. Most examples I found on stackoverflow compare based on a single value. The desired result is:

ZIP code	New column
1011AA	North-Holland
1011AA	North-Holland
2316XH	Haarlem
5815NE	Utrecht

Bonus points if you can do it using map. For example, df1["New column"] = df1["ZIP code"].str[:4].map(... ? ...). However, if the map method is a bad idea please suggest a better method.

`df1['Province'] = pd.merge_asof(df1.assign(key=df1['ZIP code'].str.extract('(\d+)', expand=False).astype(int)), df2, left_on='key', right_on='starting value ZIP code range').query('key <= \`last value ZIP code range\`')['Province']` — mozway, Feb 28 '23 at 10:14
The query does not look correct. Notice that the the province is not unique in df2. That is, the ranges are split. — Xtiaan, Feb 28 '23 at 10:21
It shouldn't matter. Let me provide a full answer and then you can test it and give me a counter example if needed — mozway, Feb 28 '23 at 10:22
Please check the answer below and report any incorrect behavior with a reproducible example — mozway, Feb 28 '23 at 10:28

mozway · Accepted Answer · 2023-02-28T10:30:50.560

1

As your ranges are non-overlapping, you can use a merge_asof on the starting boundary and filter its output (for example with query) to ensure it's within the ending boundary:

df1['Province'] = (
 pd.merge_asof(df1.assign(key=df1['ZIP code'].str.extract('(\d+)', expand=False).astype(int)), df2,
               left_on='key', right_on='starting value ZIP code range')
   .query('key <= `last value ZIP code range`')['Province']
)

Output:

  ZIP code Other columns       Province
0   1011AA           ...  North-Holland
1   1011AA           ...  North-Holland
2   2316XH           ...            NaN
3   5815NE           ...            NaN

other example

Let's add one more entry to df2:

# df2
   starting value ZIP code range  last value ZIP code range       Province
0                           1000                       1200  North-Holland
1                           1201                       1500  South-Holland
2                           1501                       1570  North-Holland
3                           1571                       1600       Den Haag
4                           5000                       6000        Utrecht


# output
  ZIP code Other columns       Province
0   1011AA           ...  North-Holland
1   1011AA           ...  North-Holland
2   2316XH           ...            NaN
3   5815NE           ...        Utrecht

Ensuring the boundaries in df2 are numeric:

df2[['starting value ZIP code range', 'last value ZIP code range']] = \
df2[['starting value ZIP code range', 'last value ZIP code range']].apply(pd.to_numeric, errors='coerce')

edited Feb 28 '23 at 10:30

answered Feb 28 '23 at 10:23

mozway

194,879
13
39
75

I get the following error: `Incompatible merge dtype, dtype('O') and dtype('int32'), both sides must have numeric dtype` – Xtiaan Feb 28 '23 at 10:27
Ensure that the start/last columns in `df2` are numeric. See the update – mozway Feb 28 '23 at 10:30
Almost there, the current error is that we are comparing int32 and int64: `incompatible merge keys [0] dtype('int32') and dtype('int64'), must be the same type`. Can I adjust the `.astype(int)`? – Xtiaan Feb 28 '23 at 10:33
I solved it by using .astype(np.int64) – Xtiaan Feb 28 '23 at 10:39
Could you maybe explain why it is sufficient to only compare the `key` to the last value? I can see that we merge on the starting value, but I am still a little but puzzled – Xtiaan Feb 28 '23 at 10:56
the `merge_asof` ensures that the match is ≥ the start, the `query` ensures than the match is <= the end. Because you have non-overlapping ranges, the `merge_asof` either puts the value in the correct range or overshoots in a non-existing one (which is then removed by the `query`) – mozway Feb 28 '23 at 11:00
So I tried to apply this example to my entire data set and got the following error: `int() argument must be a string, a bytes-like object or a real number, not 'NoneType'`. Could this be caused a NaN values? – Xtiaan Feb 28 '23 at 12:00
Can you provide a reproducible example? You might want to `dropna` on `df2`. – mozway Feb 28 '23 at 12:05
I cannot share the data. I do not mind skipping NaN values, but I do not want to drop them from my data set. Does dropna only skip the NaN values or does it remove them entirely from my data set? – Xtiaan Feb 28 '23 at 12:10
`dropna` doesn't mutate the original dataframe. If the NaNs are in `df2`, use `df2.dropna()` in the `merge_asof` – mozway Feb 28 '23 at 12:17

How to match a partial string to a range of possible values?

1 Answers1

other example