How to use python_dateutil 1.5 'parse' function to work with unicode?

Question

I need that Python_dateutil 1.5 parse() work with Unicode month names.

If use fuzzy=True it skips month name and produce result with month = 1

When I use it without fuzzy parameter I get the next exception:

from dateutil.parser import parserinfo, parser, parse

class myparserinfo(parserinfo):
    MONTHS = parserinfo.MONTHS[:]
    MONTHS[3] = (u"Foo", u"Foo", u"Июнь")


>>> test = unicode('8th of Июнь', 'utf-8')
>>> tester = parse(test, parserinfo=myparserinfo())
Traceback (most recent call last):
  File "<console>", line 1, in <module>
  File "C:\Python27\lib\site-packages\python_dateutil-1.5-py2.7.egg\dateutil\parser.py", line 695, in parse
    return parser(parserinfo).parse(timestr, **kwargs)
  File "C:\Python27\lib\site-packages\python_dateutil-1.5-py2.7.egg\dateutil\parser.py", line 303, in parse
    raise ValueError, "unknown string format"
ValueError: unknown string format

RE: Try encoding as UTF-8. test = test.encode("utf-8") It does not help — Oleg Dats, Jan 17 '12 at 15:23
I think your use of the term "unicode" is misleading in this question. I think you mean "Russian" instead. I think you are asking how to python-dateutil 1.5 to be able to parse dates written in Russian. Adding the tag "internationalization" will get your question more visibility to people who know about this. — Jim DeLaHunt, Jan 17 '12 at 18:30
Not an answer, but as an alternative you could consider using PyICU http://pyicu.osafoundation.org/ which uses up to date CLDR data. — Steven R. Loomis, Jan 18 '12 at 23:12

score 8 · Accepted Answer · answered Jan 18 '12 at 21:16

Rik Poggi is right, string 'Июнь' cannot be a month for python-dateutil. Digging a little into dateutil/parser.py, the basic problem is that this module is only internationalised enough for handling Western European Latin-script languages. It is not designed up to be able to handle languages, such as Russian, using non-Latin scripts, such as Cyrillic.

The biggest obstacle is in dateutil/parser.py:45-48, where the lexical analyser class _timelex defines the characters which can be used in tokens, including month and day names:

class _timelex(object):
    def __init__(self, instream):
        # ... [some material omitted] ...
        self.wordchars = ('abcdfeghijklmnopqrstuvwxyz'
                          'ABCDEFGHIJKLMNOPQRSTUVWXYZ_'
                          'ßàáâãäåæçèéêëìíîïðñòóôõöøùúûüýþÿ'
                          'ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖØÙÚÛÜÝÞ')
        self.numchars = '0123456789'
        self.whitespace = ' \t\r\n'

Because wordchars does not include Cyrillic letters, _timelex emits each byte in the date string as a separate character. This is what Rik observed.

Another large obstacle is that dateutil uses Python byte strings instead of Unicode strings internally for all of its processing. This means that, even if _timelex was extended to accept Cyrillic letters, then there would still be mismatches between handling of bytes and of characters, and problems caused by difference in string encoding between the caller and python_dateutil source code.

There are other minor issues, such as an assumption that every month name is at least 3 characters long (not true for Japanese), and many details related to the Gregorian calendar. It would be helpful for the wordchars field to be picked up from parserinfo if present, so that parserinfo could define the right set of characters for its month and day names.

python_dateutil v 2.0 has been ported to Python 3, but the above design problems aren't significantly changed. The differences betwen 2.0 and 1.5 are to handle Pyhon language changes, not dateutil's design and data structures.

Oleg, you were able to modify parserinfo, and I suspect you succeeded because your test code didn't use the parser() (and _timelex) of python_dateutil. You in essence supplied your own parser and lexer.

Correcting this problem would require fairly major improvements to the text-handling of python_dateutil. It would be great if someone were to make a patch with that change, and the package maintainers were able to incorporate it.

Rik Poggi · Answer 2 · 2012-01-17T18:18:36.497

I took a look at the source code in dateutil/parser.py, and I've found out basically that the string 'Июнь' cannot be a month for dateutil.

The problem starts when your timestr gets splitted.

At line 349 you have:

l = _timelex.split(timestr)

and since _timelex.split is defined like:

def split(cls, s):      # at line 142
    return list(cls(s))

you get l to be:

['8', 'th', ' ', 'of', ' ', '\x18', '\x04', 'N', '\x04', '=', '\x04', 'L', '\x04']

instead of (more or less) what one would expected it to be:

[u'8th', u'of', u'\u0418\u044e\u043d\u044c']

For this reason the month check return None , which leads to raise an Exception.

# Check month name
value = info.month(l[i])

Possible workaround:

Translate everything in english and then if needed back in russian.

Example:

dictionary = {u"Июнь": 'June', u'ноябрь': 'November'}

for russian,english in dictionary.items():
    test = test.replace(russian,english)

Thank you. I understand. But how to fix it so I can use unicode month name in parse function ? — Oleg Dats, Jan 17 '12 at 17:20
@OlegDats: The easier workaround that comes to mind is to translate all in english and at the end if needed you can come back to russian. I just updated my answer with a simple example. — Rik Poggi, Jan 17 '12 at 18:20

How to use python_dateutil 1.5 'parse' function to work with unicode?

2 Answers2

Possible workaround:

Linked