23

When I use .lower() in Python 2.7, string is not converted to lowercase for letters ŠČŽ. I read data from dictionary.

I tried using str(tt["code"]).lower(), tt["code"].lower().

Any suggestions ?

user
  • 5,370
  • 8
  • 47
  • 75
Yebach
  • 1,661
  • 8
  • 31
  • 58
  • 1
    Have a look at http://stackoverflow.com/questions/727507/how-can-i-do-unicode-uppercase , I think it is probably related. – mgilson Mar 30 '12 at 12:45

2 Answers2

30

Use unicode strings:

drostie@signy:~$ python
Python 2.7.2+ (default, Oct  4 2011, 20:06:09) 
[GCC 4.6.1] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> print "ŠČŽ"
ŠČŽ
>>> print "ŠČŽ".lower()
ŠČŽ
>>> print u"ŠČŽ".lower()
ščž

See that little u? That means that it's created as a unicode object rather than a str object.

CR Drost
  • 9,637
  • 1
  • 25
  • 36
  • I am reading from dict so how to convert tt["code"] to u"ŠČŽ"? – Yebach Mar 30 '12 at 13:07
  • Use **unicode(tt["code"], 'latin2')**, where 'latin2' is encoding used, so you may need to use different one. – Tupteq Mar 30 '12 at 13:31
  • 3
    Also note the `unicode.lower()` is locale-dependent. It might give different results depending on the environment it runs in. – Sven Marnach Mar 30 '12 at 13:43
  • @SvenMarnach: indeed, it is locale dependent, but the differences due to locale are minimal, close to the differences due to not using Unicode - since in this case, lower and upper will only understand ascii anyway – jsbueno Mar 30 '12 at 18:53
  • 1
    @Yebach : read this piece, it will help you a lot: http://www.joelonsoftware.com/articles/Unicode.html - and - after that - use the "decode" string method to convert your strings to unicode – jsbueno Mar 30 '12 at 18:54
  • @Chrisdrost: I think it would be nice if yo0u would add the bit about using the "decode" string method to getting unicode outof string literals to your answer. That is the way to go. – jsbueno Mar 30 '12 at 18:56
4

Use unicode:

>>> print u'ŠČŽ'.lower().encode('utf8')
ščž
>>>

You need to convert your text to unicode as soon as it enters your programme from the outside world, rather than merely at the point at which you notice an issue.

Accordingly, either use the codecs module to read in decoded text, or use 'bytestring'.decode('latin2') (where in place of latin2 you should use whatever the actual encoding is).

Marcin
  • 48,559
  • 18
  • 128
  • 201
  • I am reading from dict so how to convert tt["code"] to u"ŠČŽ"? I can not use ustr(tt["code"]).lower().encode('utf8') or str(tt[u"code"]).lower().encode('utf8') – Yebach Mar 30 '12 at 13:14