0

How a string size is calculated in python? I tried a below code:

s = "test"
s.__sizeof__()
53

bytes(s, "utf-8").__sizeof__()
37

bytes(s, "utf-16").__sizeof__()
43

bytes(s, "utf-32").__sizeof__()
53

How does python calculate the size for a string? Even if I consider, utf-8 encoding, any character can take anywhere between 1 byte to 4 bytes. Even if I consider the maximum size of 4 bytes per character, a string of 4 characters should take around 16 bytes, but __sizeof__ function shows bytes ranging from 37 bytes to 53 bytes based on the encoding chosen.

ForceBru
  • 43,482
  • 10
  • 63
  • 98
user2819403
  • 37
  • 1
  • 8
  • Does this answer your question? [Python : Get size of string in bytes](https://stackoverflow.com/questions/30686701/python-get-size-of-string-in-bytes) – deadshot Apr 05 '20 at 13:29
  • This may answer your question: [sizeof(string) not equal to string length](https://stackoverflow.com/a/38749126/5893316) or a more generic Q/A about this topic: [How do I determine the size of an object in Python?](https://stackoverflow.com/q/449560/5893316) – Martin Backasch Apr 05 '20 at 13:38

2 Answers2

0

__sizeof__ calculates the size of the underlying Python object, and these objects are more complicated than the literal bytes that comprise a string.

An empty bytes object is 33 bytes:

>>> b''.__sizeof__()
33

"test" in UTF-8 is exactly 4 bytes wide, so you get:

bytes(s, "utf-8").__sizeof__()
37 == b''.__sizeof__() + 4

The other encodings seem to encode some characters with more than 2 and 4 bytes, respectively, so you get sizes greater than 33 + 2 * 4 = 41 and 33 + 4 * 4 = 49.

ForceBru
  • 43,482
  • 10
  • 63
  • 98
0

If you just print the following commands, you will see that __sizeof__ is bringing you the size result of each result below:

>>> s='test'
>>> bytes(s,'utf-8').__sizeof__()
37
>>> bytes(s,'utf-8')
b'test'
>>> bytes(s,'utf-16')
b'\xff\xfet\x00e\x00s\x00t\x00'
>>> bytes(s,'utf-32')
b'\xff\xfe\x00\x00t\x00\x00\x00e\x00\x00\x00s\x00\x00\x00t\x00\x00\x00'

The way you wrote your code __sizeof__ is bringing you the size of each one of those lines:

  • b'test'
  • b'\xff\xfet\x00e\x00s\x00t\x00'
  • b'\xff\xfe\x00\x00t\x00\x00\x00e\x00\x00\x00s\x00\x00\x00t\x00\x00\x00'

And not the size of converted encoding string size.

Henrique Branco
  • 1,778
  • 1
  • 13
  • 40