I have a pandas
data frame with multiple columns of strings representing dates, with empty strings representing missing dates. For example
import numpy as np
import pandas as pd
# expected date format is 'm/%d/%Y'
custId = np.array(list(range(1,6)))
eventDate = np.array(["06/10/1992","08/24/2012","04/24/2015","","10/14/2009"])
registerDate = np.array(["06/08/2002","08/20/2012","04/20/2015","","10/10/2009"])
# both date columns of dfGood should convert to datetime without error
dfGood = pd.DataFrame({'custId':custId, 'eventDate':eventDate, 'registerDate':registerDate})
I am trying to:
- Efficiently convert columns where all strings are valid dates or empty into columns of type
datetime64
(withNaT
for the empty) - Raise
ValueError
when any non-empty string does not conform to the expected format,
Example of where ValueError
should be raised:
# 2nd string invalid
registerDate = np.array(["06/08/2002","20/08/2012","04/20/2015","","10/10/2009"])
# eventDate column should convert, registerDate column should raise ValueError
dfBad = pd.DataFrame({'custId':custId, 'eventDate':eventDate, 'registerDate':registerDate})
This function does what I want at the element level:
from datetime import datetime
def parseStrToDt(s, format = '%m/%d/%Y'):
"""Parse a string to datetime with the supplied format."""
return pd.NaT if s=='' else datetime.strptime(s, format)
print(parseStrToDt("")) # correctly returns NaT
print(parseStrToDt("12/31/2011")) # correctly returns 2011-12-31 00:00:00
print(parseStrToDt("12/31/11")) # correctly raises ValueError
However, I have read that string operations shouldn't be np.vectorize
-d. I thought this could be done efficiently using pandas.DataFrame.apply
, as in:
dfGood[['eventDate','registerDate']].applymap(lambda s: parseStrToDt(s)) # raises TypeError
dfGood.loc[:,'eventDate'].apply(lambda s: parseStrToDt(s)) # raises same TypeError
I'm guessing that the TypeError
has something to do with my function returning a different dtype
, but I do want to take advantage of dynamic typing and replace the string with a datetime (unless ValueError is raise)... so how can I do this?