6

I parse html with python and there is date string: [ 24-Янв-17 07:24 ]. "Янв" is "Jan". I want to convert it into datetime object.

# Some beautifulsoup parsing
timeData = data.find('div', {'id' : 'time'}).text

import locale
locale.setlocale(locale.LC_TIME, 'ru_RU.UTF-8')
result = datetime.datetime.strptime(timeData, u'[ %d-%b-%y  %H:%M ]')

The error is:

ValueError: time data '[ 24-\xd0\xaf\xd0\xbd\xd0\xb2-17 07:24 ]' does not match format '[ %d-%b-%y %H:%M ]'

type(timeData) returns unicode. Encoding timeData from utf-8 returns UnicodeEncodeError. What's wrong?


chardet returns {'confidence': 0.87625, 'encoding': 'utf-8'} and when I write: datetime.datetime.strptime(timeData.encode('utf-8'), ...) it returns error as above.


Original page has window-1251 encoding.

print type(timeData)
print timeData


timeData = timeData.encode('cp1251')
print type(timeData)
print timeData

returns

<type 'unicode'>
[ 24-Янв-17 07:24 ]
<type 'str'>
[ 24-???-17 07:24 ]
Max Frai
  • 61,946
  • 78
  • 197
  • 306
  • 1
    Possible duplicate of [UnicodeEncodeError when parsing month name with Python strptime](http://stackoverflow.com/questions/11235274/unicodeencodeerror-when-parsing-month-name-with-python-strptime) – cxw Jan 24 '17 at 22:00
  • Or [this one](http://stackoverflow.com/q/2571515/2877364) – cxw Jan 24 '17 at 22:00
  • @cxw No, it didn't help me. – Max Frai Jan 24 '17 at 22:20
  • Possible duplicate of [How do I strftime a date object in a different locale?](http://stackoverflow.com/questions/18593661/how-do-i-strftime-a-date-object-in-a-different-locale) – Paul Rooney Jan 24 '17 at 22:40
  • Maybe this helps? http://stackoverflow.com/questions/11235274/unicodeencodeerror-when-parsing-month-name-with-python-strptime – Dmitry Shilyaev Jan 24 '17 at 22:51

2 Answers2

9

Quick fix

Got it! янв has to be lower-case in CPython 2.7.12. Code (works in CPy 2.7.12 and CPy 3.4.5 on cygwin):

# coding=utf8
#timeData='[ 24-Янв-17 07:24 ]'
timeData='[ 24-янв-17 07:24 ]'    ### lower-case
import datetime
import locale
locale.setlocale(locale.LC_TIME, 'ru_RU.UTF-8')
result = datetime.datetime.strptime(timeData, u'[ %d-%b-%y  %H:%M ]')
print(result)

result:

2017-01-24 07:24:00

If I use the upper-case Янв, it works in Py 3, but in Py 2 it gives

ValueError: time data '[ 24-\xd0\xaf\xd0\xbd\xd0\xb2-17 07:24 ]' does not match format '[ %d-%b-%y  %H:%M ]'

General case

To handle this in general in Python 2, lower-case first (see this answer):

# coding=utf8
timeData=u'[ 24-Янв-17 07:24 ]'
       # ^ unicode data
import datetime
import locale
locale.setlocale(locale.LC_TIME, 'ru_RU.UTF-8')
print(timeData.lower())     # works OK
result = datetime.datetime.strptime(
    timeData.lower().encode('utf8'), u'[ %d-%b-%y  %H:%M ]')
    ##               ^^^^^^^^^^^^^^ back to a string
    ##       ^^^^^^^ lowercase
print(result)

Result:

[ 24-янв-17 07:24 ]
2017-01-24 07:24:00

I can't test it with your beautifulsoup code, but, in general, get Unicode data and then use the above.

Or, if at all possible, switch to Python 3 :) .

Explanation

So how did I figure this out? I went looking in the CPython source for the code to strptime (search). I found the handy _strptime module, containing class LocaleTime. I also found a mention of LocaleTime. To print the available month names, do this (added on to the end of the code under "Quick fix," above):

from _strptime import LocaleTime
lt = LocaleTime()
print(lt.a_month)    

a_month has the abbreviated month names per the source.

On Py3, that yields:

['', 'янв', 'фев', 'мар', 'апр', 'май', 'июн', 'июл', 'авг', 'сен', 'окт', 'ноя', 'дек']
      ^ lowercase!

On Py2, that yields:

['', '\xd1\x8f\xd0\xbd\xd0\xb2',

and a bunch more. Note that the first character is \xd1\x8f, and in your error message, \xd0\xaf doesn't match.

Community
  • 1
  • 1
cxw
  • 16,685
  • 2
  • 45
  • 81
  • 1
    @Ockonal Glad to be able to help! I learned some new things, which I always enjoy. :) – cxw Jan 26 '17 at 20:53
0

You can just change russian month name with english:

ru_to_eng_months = {'Янв': 'Jan', } # fill it with other months

def ru_to_eng_datetime(ru) -> string:
    s = ru.split('-')
    eng_month  = ru_to_eng_months[s[1]]
    return s[0] + '-' + eng_month + '-' + s[2]

s = u'[ 24-Янв-17 07:24 ]'
dateTime = ru_to_eng_datetime(s)
result = datetime.datetime.strptime(dateTime, u'[ %d-%b-%y  %H:%M ]')
print(result) # 2017-01-24 07:24:00
gasabr
  • 55
  • 3
  • 9