4

I have a table from here: https://deepnote.com/project/vacunacion-en-Espana-vs-covid-19-UhxqL4bdSfGJjeyb1WDM6A/%2Fnotebook.ipynb. That's a jupyter notebook which downloads every day of vaccination data in Spain and morphs them into different evolution tables. The specific table is this:

Day admin doses complete dosis
210104 82834 0
210107 207323 0
210108 277976 0
210111 406091 0
210112 488041 0
210113 581638 0
210114 676186 0
210115 768950 0
210118 897942 4630
210119 966097 18682
210120 1025937 31284
210121 1103301 98112
210122 1165825 136912
210125 1237593 177396
210126 1291216 247394
210127 1356461 346132
210128 1395618 385518
210129 1474189 503732
210201 1609261 715784
210202 1673054 837038
210203 1764778 997956
210204 1865342 1172244
210205 1988160 1365818
210208 2105033 1572814
210209 2167241 1677564
210210 2233249 1779366
210211 2320507 1886556
210212 2423045 2000970
210215 2561608 2140182
210216 2624512 2193844
210217 2690457 2238360
210218 2782751 2289112
210219 2936011 2342052
210222 3090351 2394122
210223 3165191 2416610
210224 3284678 2442366
210225 3436158 2463564
210226 3605635 2487566
210301 3829465 2523696
210302 3908734 2533400
210303 4059320 2574004
210304 4229092 2617826
210305 4471577 2684970
210308 4712191 2766976
210309 4848648 2824472
210310 4989505 2895676
210311 5172106 3006516
210312 5352767 3166488
210315 5644895 3387348
210316 5742218 3457074
210317 5857085 3609230
210318 5993363 3773626
210322 6321908 4221736
210323 6409196 4313196
210324 6620093 4552466

And I would like to know what is the nearest value of each complete doses compared with the entire admin doses, in order to know how many days were needed for completing a vaccination period. For instance, at 210129 there were 503732 complete doses, and the nearest value is 488041 admin doses at 210112 - thus 17 days passed between the admin of 488041 doses and 503732 doses.

I tried a good number of options, but nothing works either with raw pandas or with pandas and numpy. In Excel it would be

{=INDEX(A$2:A$56;MATCH(MIN(ABS(B$2:B$56-C7));ABS(B$2:B$56-C7);0))}

But I haven't been able to translate it into Pandas

Thanks in advance for any help

  • Why are we using C7? If I load this into excel, C7 would be `0`. For your example, I assume we will look at C18? – Joe Ferndz Mar 25 '21 at 17:24
  • also see [this answer](https://stackoverflow.com/questions/2566412/find-nearest-value-in-numpy-array) for the numpy version of this question. – Umar.H Mar 25 '21 at 17:34

2 Answers2

1

To find such matches you can use merge_asof:

result = pd.merge_asof(df, df.sort_values('admin doses'), left_on='complete dosis', right_on='admin doses')

This will only set you with values, and from there you can move to calculating what you need.

zipa
  • 27,316
  • 6
  • 40
  • 58
1
def find_nearest(series):
    current_idx = series.loc["index"]
    nearest_idx = np.abs(df["admin doses"] - series.loc["complete dosis"]).argmin()
    day_diff = (days_in_dt[current_idx] - days_in_dt[nearest_idx]).days
    return day_diff

# convert Day column to TimeStamps
days_in_dt = pd.to_datetime(df.Day, format="%y%m%d")

# the result
df["complete dosis"].reset_index().apply(find_nearest, axis=1)
0      0
1      3
2      4
3      7
4      8
5      9
6     10
7     11
8     14
9     15
10    16
11    17
12    18
13    18
14    18
15    16
16    17
17    17
18    18
19    15
20    14
21    13
22     9
23     7
24     7
25     7
26     7
27     7
28     6
29     7
30     7
31     7
32     8
33    10
34    11
35    12
36    13
37    14
38    14
39    15
40    16
41    16
42    16
43    18
44    19
45    19
46    20
47    17
48    18
49    19
50    19
51    17
52    18
53    19
54    19
dtype: int64

We apply a function that finds out the day difference to the nearest value. The values passed to the function are of the form:

index                8
complete dosis    4630
Name: 8, dtype: int64

this is so that we can also get their index value and take the difference later. reset_index prior to apply gives this index information to the series passed.

Mustafa Aydın
  • 17,645
  • 4
  • 15
  • 38
  • Thanks! now, when I tested it, this error appears: `TypeError: reduction operation 'argmin' not allowed for this dtype` It is supposed that this error message appears when argmin is used with no numeric values, but this is not the case AFAIK Where is the problem? Thanks again in advance – Juan Luis Chulilla Mar 25 '21 at 22:39
  • @JuanLuisChulilla Perhaps try `df[["admin doses", "complete dosis"]] = df[["admin doses", "complete dosis"]].astype(int)` before applying the function to make sure they are integers? – Mustafa Aydın Mar 26 '21 at 05:33
  • thanks a lot again!! it works like a charm. Now, what I'm struggle to understand is how `series.loc` points to each single element of "complete dosis". It's obvious that df.column points to an entire column, but after reading pandas documentation I haven't figure out how `series.loc` works in this case. – Juan Luis Chulilla Mar 26 '21 at 11:34
  • @JuanLuisChulilla We use `.apply` on a data frame over the `axis=1`. (`df["complete dosis"]` itself is a series yes, but we `reset_index` on it and it becomes a data frame; you can run it seperately to see its behaviour). `axis=1` means pandas sends *rows* of this data frame one by one to the function `find_nearest`. As stated in the very last part of the answer, function sees a series in that form each time i.e. column names of the data frame becomes the index names. Then `.loc` is used to access the values corresponding to them e.g. `series.loc["index"] is 8 for the last example and so on. – Mustafa Aydın Mar 26 '21 at 12:27
  • 1
    Thanks again!! Muchas gracias!! Sukran!! – Juan Luis Chulilla Mar 27 '21 at 01:34