0

I have a list of valid dates and need to check a column in my dataframe (dfPA) against this list. If the date is not in the list, I replace it with the nearest date from the list. I've written the following code, which works:

def nearest(items, pivot):
    """
    https://stackoverflow.com/a/65670963/12480198
    """
    if pivot in items:
        return pivot
    else:
        return min(items, key=lambda x: abs(x - pivot))

df = dfPA[~dfPA[subdtcol].isin(ldates)]
df[subdtcol] = pd.to_datetime(df[subdtcol], errors='coerce')
for i, row in df.iterrows():
    dt = dfPA.loc[i, subdtcol]
    dfPA.loc[i, subdtcol] = nearest(ldates, dt)

this is what ldates looks like:

[datetime.datetime(1990, 1, 1, 0, 0),
 datetime.datetime(1990, 6, 6, 0, 0),
 datetime.datetime(1776, 4, 7, 0, 0),
 datetime.datetime(2012, 1, 2, 0, 0),
 datetime.datetime(2012, 1, 16, 0, 0),
 datetime.datetime(2012, 2, 6, 0, 0),
 datetime.datetime(2012, 2, 20, 0, 0),
 datetime.datetime(2012, 3, 5, 0, 0),
 datetime.datetime(2012, 3, 19, 0, 0),...]

rewritten as follows, using np.where didn't work:

dfPA[subdtcol] = np.where(dfPA[subdtcol].isin(ldates), dfPA[subdtcol],
                          nearest(ldates, dfPA[subdtcol]))

ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

What am I doing wrong and is there value (in terms of speed) in using np.where instead of a for loop? Thanks.

SModi
  • 125
  • 14
  • 1
    we do not know what dfPA is… – mozway Jul 26 '21 at 12:27
  • it's a big dataframe with some date columns. One of which needs the above check. – SModi Jul 26 '21 at 12:29
  • is `nearest` a function you wrote yourself? if yes you need to rewrite it so that it can accept a series/dataframe `dfPA[subdtcol]` instead of a single date `dt`. – Ben.T Jul 26 '21 at 12:49
  • hi @Ben.T, have added the nearest function to my question. Can't take credit for it. Shared link from where I got it. I'm don't understand lambda functions well. How would I revise it to accept a series? – SModi Jul 26 '21 at 12:57

1 Answers1

0

What I understand is that at the end there are only dates on the list.

>>> ldates
[datetime.datetime(2012, 1, 2, 0, 0),
 datetime.datetime(2012, 1, 11, 0, 0),
 datetime.datetime(2012, 1, 16, 0, 0)]

>>> df
        date  # result should be:
0 2012-01-01  # 2012-01-02
1 2012-01-04  # 2012-01-02
2 2012-01-07  # 2012-01-11
3 2012-01-10  # 2012-01-11
4 2012-01-13  # 2012-01-11
5 2012-01-16  # 2012-01-16
6 2012-01-19  # 2012-01-16

First, we need to insert dates from list into the dataframe

df.loc[np.searchsorted(df['date'], ldates), 'date2'] = ldates
print(df)

        date      date2
0 2012-01-01        NaT
1 2012-01-04 2012-01-02
2 2012-01-07        NaT  # <- nearest is 2012-01-11
3 2012-01-10        NaT
4 2012-01-13 2012-01-11
5 2012-01-16 2012-01-16
6 2012-01-19        NaT

Now, fill NaT using ffill/bfill and bfill/ffill:

df['date2_ff'] = df['date2'].ffill().bfill()
df['date2_bf'] = df['date2'].bfill().ffill()
print(df)

        date      date2   date2_ff   date2_bf
0 2012-01-01        NaT 2012-01-02 2012-01-02
1 2012-01-04 2012-01-02 2012-01-02 2012-01-02
2 2012-01-07        NaT 2012-01-02 2012-01-11
3 2012-01-10        NaT 2012-01-02 2012-01-11
4 2012-01-13 2012-01-11 2012-01-11 2012-01-11
5 2012-01-16 2012-01-16 2012-01-16 2012-01-16
6 2012-01-19        NaT 2012-01-16 2012-01-16

Finally, compute the difference and keep the minimum value with np.where:

df['new_date'] = np.where(df['date2_ff'].sub(df['date']).abs()
                          <= df['date2_bf'].sub(df['date']).abs(),
                          df['date2_ff'], df['date2_bf'])
print(df)

        date      date2   date2_ff   date2_bf   new_date
0 2012-01-01        NaT 2012-01-02 2012-01-02 2012-01-02
1 2012-01-04 2012-01-02 2012-01-02 2012-01-02 2012-01-02
2 2012-01-07        NaT 2012-01-02 2012-01-11 2012-01-11
3 2012-01-10        NaT 2012-01-02 2012-01-11 2012-01-11
4 2012-01-13 2012-01-11 2012-01-11 2012-01-11 2012-01-11
5 2012-01-16 2012-01-16 2012-01-16 2012-01-16 2012-01-16
6 2012-01-19        NaT 2012-01-16 2012-01-16 2012-01-16

In a function:

def nearest(sr, l):
    sr2 = pd.Series(l, index=np.searchsorted(sr, l)).reindex(sr.index)
    ff = sr2.ffill().bfill()
    bf = sr2.bfill().ffill()
    return np.where(ff.sub(sr).abs() <= bf.sub(sr).abs(), ff, bf)

df['new_date'] = nearest(df['date'], l)
>>> df
        date   new_date
0 2012-01-01 2012-01-02
1 2012-01-04 2012-01-02
2 2012-01-07 2012-01-11
3 2012-01-10 2012-01-11
4 2012-01-13 2012-01-11
5 2012-01-16 2012-01-16
6 2012-01-19 2012-01-16
Corralien
  • 109,409
  • 8
  • 28
  • 52