5

So I have a string:

amélie

In bytes it is b'ame\xcc\x81lie'

In utf-8 the character is combining acute accent for the previous character http://www.fileformat.info/info/unicode/char/0301/index.htm

u'ame\u0301lie'

When I do: 'amélie'.title() on that string, I get 'AméLie', which makes no sense to me.

I know I can do a workaround, but is this intended behavior or a bug? I would expect the "l" to NOT get capitalized.

another experiment:

  In [1]: [ord(c) for c in 'amélie'.title()]
  Out[1]: [65, 109, 101, 769, 76, 105, 101]

  In [2]: [ord(c) for c in 'amélie']
  Out[2]: [97, 109, 101, 769, 108, 105, 101]
lqdc
  • 511
  • 2
  • 5
  • 14
  • Interesting. For comparison, I ran it on a 2.7.x python as well. Python 2.7.10 (default, Jul 14 2015, 19:46:27) [GCC 4.2.1 Compatible Apple LLVM 6.0 (clang-600.0.39)] on darwin >>> [ord(c) for c in 'amélie'] [97, 109, 195, 169, 108, 105, 101] >>> [ord(c) for c in 'amélie'.title()] [65, 109, 195, 169, 76, 105, 101] Don't think it's a bug but also not what I would have expected. – Shawn Mehan Sep 02 '15 at 04:42
  • 1
    Knocked the version back to your edit. – Shawn Mehan Sep 02 '15 at 04:46
  • Just for your info, there is no UTF-8 involved here. Combining diacritics (I think that's the name for them) are simply part of Unicode and independent of the encoding, just as Python strings are independent of the encoding. Oh, and yes, I think it's a bug, too! – Ulrich Eckhardt Sep 02 '15 at 05:25

1 Answers1

6

Take a look at these questions: Python title() with apostrophes and Titlecasing a string with exceptions

Basically it looks like a limitation of the inbuilt title function which seems to be very liberal about what it considers a word boundary.

You can use string.capwords:

import string
string.capwords('amélie')
Out[18]: 'Amélie'

Another thing you could do is use the character é ('\xc3\xa9') which is an e with accent built in:

b'am\xc3\xa9lie'.decode().title()
Out[21]: 'Amélie'
Community
  • 1
  • 1
maxymoo
  • 35,286
  • 11
  • 92
  • 119