I am trying to use dateparser
to parse dates with years earlier that 1000, with less than four digits.
import dateparser
value = "july 900"
result = dateparser.parse(value)
result is None # True
At first I thought is related to the problem mentioned here: Use datetime.strftime() on years before 1900? ("require year >= 1900"), because some of the times with certain inputs (like just 900
) the result was the current day and month combined with the year 1900
.
But after some more trials with random dates and relative expressions, I noticed dateparser
can output dates earlier than 1000
, then I figured out that if I zero-pad the year, the result will be correct.
import dateparser
value = "july 0900"
result = dateparser.parse(value)
result is None # False
result # datetime.datetime(900, 7, 4, 0, 0)
I have found this in my search for a solution:
https://github.com/scrapinghub/dateparser/issues/410
but the final comment left me with more questions than answers, as I have failed to find a way to pass a custom parser to the internal user of dateutil.parser
of dateparser
.
My current solution is to look for regex 3 digits year patterns, using something similar to this: (.* +| *|.+[\/\-.]{1,})([1-9][0-9]{2,})( *| +.*|[\/\-.]{1,}.+)
and pad them in place.
Is there a better way to do this?
EDIT:
Is there also an elegant solution to parse dates before our era (e.g. BC)? (it seems that the dateparser
settings key SUPPORT_BEFORE_COMMON_ERA
doesn't do much in this regard, and all other seemed to be unrelated)
So that this can be used for an archeological dating site.