1

I am trying to use dateparser to parse dates with years earlier that 1000, with less than four digits.

import dateparser

value = "july 900"
result = dateparser.parse(value)
result is None  # True

At first I thought is related to the problem mentioned here: Use datetime.strftime() on years before 1900? ("require year >= 1900"), because some of the times with certain inputs (like just 900) the result was the current day and month combined with the year 1900. But after some more trials with random dates and relative expressions, I noticed dateparser can output dates earlier than 1000, then I figured out that if I zero-pad the year, the result will be correct.

import dateparser

value = "july 0900"
result = dateparser.parse(value)
result is None  # False
result  # datetime.datetime(900, 7, 4, 0, 0)

I have found this in my search for a solution: https://github.com/scrapinghub/dateparser/issues/410 but the final comment left me with more questions than answers, as I have failed to find a way to pass a custom parser to the internal user of dateutil.parser of dateparser.

My current solution is to look for regex 3 digits year patterns, using something similar to this: (.* +| *|.+[\/\-.]{1,})([1-9][0-9]{2,})( *| +.*|[\/\-.]{1,}.+) and pad them in place.

Is there a better way to do this?

EDIT:

Is there also an elegant solution to parse dates before our era (e.g. BC)? (it seems that the dateparser settings key SUPPORT_BEFORE_COMMON_ERA doesn't do much in this regard, and all other seemed to be unrelated)

So that this can be used for an archeological dating site.

No Mad
  • 11
  • 1
  • 2

1 Answers1

0

Don't use regular expressions with dates. It is hard and the corner cases will drive you nuts. The module dateutil does what you want correctly.

>>> from dateutil import parser
>>> value = "july 900"
>>> parser.parse(value)
datetime.datetime(900, 7, 4, 0, 0)

This is not a solution for dates before the current era. That is because dateutil and dateparser both work with datetimes and datetimes don't accept years less than 1.

BoarGules
  • 16,440
  • 2
  • 27
  • 44
  • Yes, that would solve part of the problem, but I do need to use the `dateparser` library, which doesn't seem to provide hooks for me to meddle with it. – No Mad Apr 04 '19 at 15:10
  • In that case your best bet would be to report this issue as a bug to the `dateparser` maintainers. The module claims to parse "localized dates in almost any string formats commonly found on web pages" and I think "July 900" should qualify. – BoarGules Apr 04 '19 at 15:12
  • That seems the most reasonable course of action, and I will do it. Though I was hoping someone else encountered this problem and found a solution outside the library's code. – No Mad Apr 05 '19 at 07:24