1

I have the following dataframes:

df1:

ZIP code Other columns
1011AA ...
1011AA ...
2316XH ...
5815NE ...

df2:

starting value ZIP code range last value ZIP code range Province
1000 1200 North-Holland
1201 1500 South-Holland
1501 1570 North-Holland
1571 1600 Den Haag

I want to:

  1. Get the first four digits of df1["ZIP code"]
  2. Check if these four digits are present in any range in df2["starting value ZIP code range"] and df["last value ZIP code range"]
  3. If there is a match, get df2["Province"] and add this value to a column in df1.

The difficulty is that I need to compare this to a range of values and I can only use the first 4 digits of the string. Most examples I found on stackoverflow compare based on a single value. The desired result is:

ZIP code New column
1011AA North-Holland
1011AA North-Holland
2316XH Haarlem
5815NE Utrecht

Bonus points if you can do it using map. For example, df1["New column"] = df1["ZIP code"].str[:4].map(... ? ...). However, if the map method is a bad idea please suggest a better method.

Xtiaan
  • 252
  • 1
  • 11
  • `df1['Province'] = pd.merge_asof(df1.assign(key=df1['ZIP code'].str.extract('(\d+)', expand=False).astype(int)), df2, left_on='key', right_on='starting value ZIP code range').query('key <= \`last value ZIP code range\`')['Province']` – mozway Feb 28 '23 at 10:14
  • The query does not look correct. Notice that the the province is not unique in df2. That is, the ranges are split. – Xtiaan Feb 28 '23 at 10:21
  • It shouldn't matter. Let me provide a full answer and then you can test it and give me a counter example if needed – mozway Feb 28 '23 at 10:22
  • Please check the answer below and report any incorrect behavior with a reproducible example – mozway Feb 28 '23 at 10:28

1 Answers1

1

As your ranges are non-overlapping, you can use a merge_asof on the starting boundary and filter its output (for example with query) to ensure it's within the ending boundary:

df1['Province'] = (
 pd.merge_asof(df1.assign(key=df1['ZIP code'].str.extract('(\d+)', expand=False).astype(int)), df2,
               left_on='key', right_on='starting value ZIP code range')
   .query('key <= `last value ZIP code range`')['Province']
)

Output:

  ZIP code Other columns       Province
0   1011AA           ...  North-Holland
1   1011AA           ...  North-Holland
2   2316XH           ...            NaN
3   5815NE           ...            NaN

other example

Let's add one more entry to df2:

# df2
   starting value ZIP code range  last value ZIP code range       Province
0                           1000                       1200  North-Holland
1                           1201                       1500  South-Holland
2                           1501                       1570  North-Holland
3                           1571                       1600       Den Haag
4                           5000                       6000        Utrecht


# output
  ZIP code Other columns       Province
0   1011AA           ...  North-Holland
1   1011AA           ...  North-Holland
2   2316XH           ...            NaN
3   5815NE           ...        Utrecht

Ensuring the boundaries in df2 are numeric:

df2[['starting value ZIP code range', 'last value ZIP code range']] = \
df2[['starting value ZIP code range', 'last value ZIP code range']].apply(pd.to_numeric, errors='coerce')
mozway
  • 194,879
  • 13
  • 39
  • 75
  • I get the following error: `Incompatible merge dtype, dtype('O') and dtype('int32'), both sides must have numeric dtype` – Xtiaan Feb 28 '23 at 10:27
  • Ensure that the start/last columns in `df2` are numeric. See the update – mozway Feb 28 '23 at 10:30
  • Almost there, the current error is that we are comparing int32 and int64: `incompatible merge keys [0] dtype('int32') and dtype('int64'), must be the same type`. Can I adjust the `.astype(int)`? – Xtiaan Feb 28 '23 at 10:33
  • I solved it by using .astype(np.int64) – Xtiaan Feb 28 '23 at 10:39
  • Could you maybe explain why it is sufficient to only compare the `key` to the last value? I can see that we merge on the starting value, but I am still a little but puzzled – Xtiaan Feb 28 '23 at 10:56
  • the `merge_asof` ensures that the match is ≥ the start, the `query` ensures than the match is <= the end. Because you have non-overlapping ranges, the `merge_asof` either puts the value in the correct range or overshoots in a non-existing one (which is then removed by the `query`) – mozway Feb 28 '23 at 11:00
  • So I tried to apply this example to my entire data set and got the following error: `int() argument must be a string, a bytes-like object or a real number, not 'NoneType'`. Could this be caused a NaN values? – Xtiaan Feb 28 '23 at 12:00
  • Can you provide a reproducible example? You might want to `dropna` on `df2`. – mozway Feb 28 '23 at 12:05
  • I cannot share the data. I do not mind skipping NaN values, but I do not want to drop them from my data set. Does dropna only skip the NaN values or does it remove them entirely from my data set? – Xtiaan Feb 28 '23 at 12:10
  • `dropna` doesn't mutate the original dataframe. If the NaNs are in `df2`, use `df2.dropna()` in the `merge_asof` – mozway Feb 28 '23 at 12:17