How can I find the nearest value of a pandas element compared with another entire column?

Question

I have a table from here: https://deepnote.com/project/vacunacion-en-Espana-vs-covid-19-UhxqL4bdSfGJjeyb1WDM6A/%2Fnotebook.ipynb. That's a jupyter notebook which downloads every day of vaccination data in Spain and morphs them into different evolution tables. The specific table is this:

Day	admin doses	complete dosis
210104	82834	0
210107	207323	0
210108	277976	0
210111	406091	0
210112	488041	0
210113	581638	0
210114	676186	0
210115	768950	0
210118	897942	4630
210119	966097	18682
210120	1025937	31284
210121	1103301	98112
210122	1165825	136912
210125	1237593	177396
210126	1291216	247394
210127	1356461	346132
210128	1395618	385518
210129	1474189	503732
210201	1609261	715784
210202	1673054	837038
210203	1764778	997956
210204	1865342	1172244
210205	1988160	1365818
210208	2105033	1572814
210209	2167241	1677564
210210	2233249	1779366
210211	2320507	1886556
210212	2423045	2000970
210215	2561608	2140182
210216	2624512	2193844
210217	2690457	2238360
210218	2782751	2289112
210219	2936011	2342052
210222	3090351	2394122
210223	3165191	2416610
210224	3284678	2442366
210225	3436158	2463564
210226	3605635	2487566
210301	3829465	2523696
210302	3908734	2533400
210303	4059320	2574004
210304	4229092	2617826
210305	4471577	2684970
210308	4712191	2766976
210309	4848648	2824472
210310	4989505	2895676
210311	5172106	3006516
210312	5352767	3166488
210315	5644895	3387348
210316	5742218	3457074
210317	5857085	3609230
210318	5993363	3773626
210322	6321908	4221736
210323	6409196	4313196
210324	6620093	4552466

And I would like to know what is the nearest value of each complete doses compared with the entire admin doses, in order to know how many days were needed for completing a vaccination period. For instance, at 210129 there were 503732 complete doses, and the nearest value is 488041 admin doses at 210112 - thus 17 days passed between the admin of 488041 doses and 503732 doses.

I tried a good number of options, but nothing works either with raw pandas or with pandas and numpy. In Excel it would be

{=INDEX(A$2:A$56;MATCH(MIN(ABS(B$2:B$56-C7));ABS(B$2:B$56-C7);0))}

But I haven't been able to translate it into Pandas

Thanks in advance for any help

Why are we using C7? If I load this into excel, C7 would be `0`. For your example, I assume we will look at C18? — Joe Ferndz, Mar 25 '21 at 17:24
also see [this answer](https://stackoverflow.com/questions/2566412/find-nearest-value-in-numpy-array) for the numpy version of this question. — Umar.H, Mar 25 '21 at 17:34

score 1 · Answer 1 · answered Mar 25 '21 at 17:30

1

To find such matches you can use merge_asof:

result = pd.merge_asof(df, df.sort_values('admin doses'), left_on='complete dosis', right_on='admin doses')

This will only set you with values, and from there you can move to calculating what you need.

answered Mar 25 '21 at 17:30

zipa

27,316
6
40
58

score 1 · Accepted Answer · answered Mar 25 '21 at 18:05

1

def find_nearest(series):
    current_idx = series.loc["index"]
    nearest_idx = np.abs(df["admin doses"] - series.loc["complete dosis"]).argmin()
    day_diff = (days_in_dt[current_idx] - days_in_dt[nearest_idx]).days
    return day_diff

# convert Day column to TimeStamps
days_in_dt = pd.to_datetime(df.Day, format="%y%m%d")

# the result
df["complete dosis"].reset_index().apply(find_nearest, axis=1)

0      0
1      3
2      4
3      7
4      8
5      9
6     10
7     11
8     14
9     15
10    16
11    17
12    18
13    18
14    18
15    16
16    17
17    17
18    18
19    15
20    14
21    13
22     9
23     7
24     7
25     7
26     7
27     7
28     6
29     7
30     7
31     7
32     8
33    10
34    11
35    12
36    13
37    14
38    14
39    15
40    16
41    16
42    16
43    18
44    19
45    19
46    20
47    17
48    18
49    19
50    19
51    17
52    18
53    19
54    19
dtype: int64

We apply a function that finds out the day difference to the nearest value. The values passed to the function are of the form:

index                8
complete dosis    4630
Name: 8, dtype: int64

this is so that we can also get their index value and take the difference later. reset_index prior to apply gives this index information to the series passed.

answered Mar 25 '21 at 18:05

Mustafa Aydın

17,645
4
15
38

Thanks! now, when I tested it, this error appears: `TypeError: reduction operation 'argmin' not allowed for this dtype` It is supposed that this error message appears when argmin is used with no numeric values, but this is not the case AFAIK Where is the problem? Thanks again in advance – Juan Luis Chulilla Mar 25 '21 at 22:39
@JuanLuisChulilla Perhaps try `df[["admin doses", "complete dosis"]] = df[["admin doses", "complete dosis"]].astype(int)` before applying the function to make sure they are integers? – Mustafa Aydın Mar 26 '21 at 05:33
thanks a lot again!! it works like a charm. Now, what I'm struggle to understand is how `series.loc` points to each single element of "complete dosis". It's obvious that df.column points to an entire column, but after reading pandas documentation I haven't figure out how `series.loc` works in this case. – Juan Luis Chulilla Mar 26 '21 at 11:34
@JuanLuisChulilla We use `.apply` on a data frame over the `axis=1`. (`df["complete dosis"]` itself is a series yes, but we `reset_index` on it and it becomes a data frame; you can run it seperately to see its behaviour). `axis=1` means pandas sends *rows* of this data frame one by one to the function `find_nearest`. As stated in the very last part of the answer, function sees a series in that form each time i.e. column names of the data frame becomes the index names. Then `.loc` is used to access the values corresponding to them e.g. `series.loc["index"] is 8 for the last example and so on. – Mustafa Aydın Mar 26 '21 at 12:27
1

Thanks again!! Muchas gracias!! Sukran!! – Juan Luis Chulilla Mar 27 '21 at 01:34

How can I find the nearest value of a pandas element compared with another entire column?

2 Answers2