Rik Poggi is right, string 'Июнь' cannot be a month for python-dateutil
. Digging a little into dateutil/parser.py
, the basic problem is that this module is only internationalised enough for handling Western European Latin-script languages. It is not designed up to be able to handle languages, such as Russian, using non-Latin scripts, such as Cyrillic.
The biggest obstacle is in dateutil/parser.py:45-48
, where the lexical analyser class _timelex
defines the characters which can be used in tokens, including month and day names:
class _timelex(object):
def __init__(self, instream):
# ... [some material omitted] ...
self.wordchars = ('abcdfeghijklmnopqrstuvwxyz'
'ABCDEFGHIJKLMNOPQRSTUVWXYZ_'
'ßàáâãäåæçèéêëìíîïðñòóôõöøùúûüýþÿ'
'ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖØÙÚÛÜÝÞ')
self.numchars = '0123456789'
self.whitespace = ' \t\r\n'
Because wordchars
does not include Cyrillic letters, _timelex
emits each byte in the date string as a separate character. This is what Rik observed.
Another large obstacle is that dateutil
uses Python byte strings instead of Unicode strings internally for all of its processing. This means that, even if _timelex was extended to accept Cyrillic letters, then there would still be mismatches between handling of bytes and of characters, and problems caused by difference in string encoding between the caller and python_dateutil
source code.
There are other minor issues, such as an assumption that every month name is at least 3 characters long (not true for Japanese), and many details related to the Gregorian calendar. It would be helpful for the wordchars
field to be picked up from parserinfo
if present, so that parserinfo could define the right set of characters for its month and day names.
python_dateutil
v 2.0 has been ported to Python 3, but the above design problems aren't significantly changed. The differences betwen 2.0 and 1.5 are to handle Pyhon language changes, not dateutil's design and data structures.
Oleg, you were able to modify parserinfo, and I suspect you succeeded because your test code didn't use the parser()
(and _timelex
) of python_dateutil
. You in essence supplied your own parser and lexer.
Correcting this problem would require fairly major improvements to the text-handling of python_dateutil
. It would be great if someone were to make a patch with that change, and the package maintainers were able to incorporate it.