9

Having an UTF-8 string like this:

mystring = "işğüı"

is it possible to get its (in memory) size in Bytes with Python (2.5)?

systempuntoout
  • 71,966
  • 47
  • 171
  • 241
  • Well, I get 9 when I do `len(mystring)` – NullUserException Oct 01 '10 at 19:41
  • If you convert it to a unicode literal you get 5 ``mystring = u"işğüı"`. other wise, it turns into `'i\xc5\x9f\xc4\x9f\xc3\xbc\xc4\xb1'` – aaronasterling Oct 01 '10 at 19:45
  • Which means that slicing such a string may get you illegal characters. Try `mystring[2:6]`. Just putting this out there as I am surprised as well. – Muhammad Alkarouri Oct 01 '10 at 21:47
  • possible duplicate of [How can I determine the byte length of a utf-8 encoded string in Python?](http://stackoverflow.com/questions/6714826/how-can-i-determine-the-byte-length-of-a-utf-8-encoded-string-in-python) – meshy Feb 20 '15 at 23:53

1 Answers1

7

Assuming you mean the number of UTF-8 bytes (and not the extra bytes that Python requires to store the object), it’s the same as for the length of any other string. A string literal in Python 2.x is a string of encoded bytes, not Unicode characters.

Byte strings:

>>> mystring = "işğüı"
>>> print "length of {0} is {1}".format(repr(mystring), len(mystring))
length of 'i\xc5\x9f\xc4\x9f\xc3\xbc\xc4\xb1' is 9

Unicode strings:

>>> myunicode = u"işğüı"
>>> print "length of {0} is {1}".format(repr(myunicode), len(myunicode))
length of u'i\u015f\u011f\xfc\u0131' is 5

It’s good practice to maintain all of your strings in Unicode, and only encode when communicating with the outside world. In this case, you could use len(myunicode.encode('utf-8')) to find the size it would be after encoding.

Josh Lee
  • 171,072
  • 38
  • 269
  • 275
  • 3
    This answer is wrong. To correctly calculate the number of bytes (octets) in a string you need to look at the encoded string as utf8 characters range from 1-4 bytes, do: `len(bytes(u'計算機', 'utf8')) # returns 9` NOT `len(u'計算機') # returns 3` – Karsten Apr 30 '21 at 17:07