Python : Get size of string in bytes

Question

I have a string that is to be sent over a network. I need to check the total bytes it is represented in.

sys.getsizeof(string_name) returns extra bytes. For example for sys.getsizeof("a") returns 22 , while one character is only represented in 1 byte in python. Is there some other method to find this ?

That's because the string "a" is an object in python that contains extra information. — Mmm Donuts, Jun 06 '15 at 19:27
@Some Developer is there a way to get bytes for the string only, without extra information of the complete object? — Iffat Fatima, Jun 06 '15 at 19:29
Does this answer your question? [How can I determine the byte length of a utf-8 encoded string in Python?](https://stackoverflow.com/questions/6714826/how-can-i-determine-the-byte-length-of-a-utf-8-encoded-string-in-python) — maxschlepzig, Nov 01 '20 at 18:42

Mmm Donuts · Accepted Answer · 2015-06-06T19:37:32.130

228

If you want the number of bytes in a string, this function should do it for you pretty solidly.

def utf8len(s):
    return len(s.encode('utf-8'))

The reason you got weird numbers is because encapsulated in a string is a bunch of other information due to the fact that strings are actual objects in python.

Its interesting because if you look at my solution to encode the string into 'utf-8', there's an 'encode' method on the 's' object (which is a string). Well, it needs to be stored somewhere right? Hence, the higher than normal byte count. Its including that method, along with a few others :).

edited Jun 06 '15 at 19:37

answered Jun 06 '15 at 19:28

Mmm Donuts

9,551
6
27
49

No worries. Sometimes simple answers make their way into seemingly weird problems haha. – Mmm Donuts Jun 06 '15 at 19:35
26

The reason for encoding is that, in Python 3, some single-character strings will require multiple bytes to be represented. For instance: `len('你'.encode('utf-8'))`. – Brad Solomon Jul 17 '18 at 03:34

score 28 · Answer 2 · answered Mar 16 '19 at 20:54

There's a caveat to the accepted answer.

For some multi-byte encodings (e.g. utf-16), string.encode will add a Byte Order Mark (BOM) at the start, which is a sequence of special bytes that inform the reader on the byte endianness used. So the length you get is actually len(BOM) + len(encoded_word).

If you don't want to count the BOM bytes, you can use either the little-endian version of the encoding (adding the suffix "-le") or the big-endian version (adding the suffix "be").

>>> len('ciao'.encode('utf-16'))
10
>>> len('ciao'.encode('utf-16-le'))
8

Python : Get size of string in bytes

2 Answers2

Linked