I have a list of valid dates and need to check a column in my dataframe (dfPA) against this list. If the date is not in the list, I replace it with the nearest date from the list. I've written the following code, which works:
def nearest(items, pivot):
"""
https://stackoverflow.com/a/65670963/12480198
"""
if pivot in items:
return pivot
else:
return min(items, key=lambda x: abs(x - pivot))
df = dfPA[~dfPA[subdtcol].isin(ldates)]
df[subdtcol] = pd.to_datetime(df[subdtcol], errors='coerce')
for i, row in df.iterrows():
dt = dfPA.loc[i, subdtcol]
dfPA.loc[i, subdtcol] = nearest(ldates, dt)
this is what ldates looks like:
[datetime.datetime(1990, 1, 1, 0, 0),
datetime.datetime(1990, 6, 6, 0, 0),
datetime.datetime(1776, 4, 7, 0, 0),
datetime.datetime(2012, 1, 2, 0, 0),
datetime.datetime(2012, 1, 16, 0, 0),
datetime.datetime(2012, 2, 6, 0, 0),
datetime.datetime(2012, 2, 20, 0, 0),
datetime.datetime(2012, 3, 5, 0, 0),
datetime.datetime(2012, 3, 19, 0, 0),...]
rewritten as follows, using np.where didn't work:
dfPA[subdtcol] = np.where(dfPA[subdtcol].isin(ldates), dfPA[subdtcol],
nearest(ldates, dfPA[subdtcol]))
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
What am I doing wrong and is there value (in terms of speed) in using np.where instead of a for loop? Thanks.