len() returns wrong number for string

Question

I'm running a simple script on my command line: echo "Alex " > alex.txt

len(open("alex.txt").read()) returns 16 instead of 5

When I run open("alex.txt").read() I get:

ÿþA\x00l\x00e\x00x\x00 \x00\n\x00\n\x00

What is the issue?

your reading the elements within the text and it seems that your text is having some uni code languages ' — Ahmed amin shahin, Oct 13 '21 at 21:39
yes, looks like UTF16 with a byte order marker at the beginning. You also have two newline characters which may have come from editing the file? — Garr Godfrey, Oct 13 '21 at 21:47
Your command line is writing `test.txt` but your code is reading some other file. And if you don't specify a full path in your `open()` call, Python uses what it reckons is the current working directory, which is often not what novices expect. — BoarGules, Oct 13 '21 at 21:49
`open` has an optional `encoding` parameter that you can experiment with — John Coleman, Oct 13 '21 at 21:49
The two newlines were probably `"\r\n"`, but because of the mistaken decoding, the null byte in between them prevented Python from transforming them into a single newline. — Blckknght, Oct 13 '21 at 22:01
Python 3. And I've edited it to Alex.txt, sorry about that. It's not a matter of a different file. — Alex, Oct 14 '21 at 14:18
I've seen now that if I open text editor write "Alex " into it and then read it the len is indeed 5. So "echo" in the CLI is what causing the problem and I don't understand why. — Alex, Oct 14 '21 at 14:34

Garr Godfrey · Answer 1 · 2021-10-13T22:26:55.033

2

The number of bytes in a file and the number of characters in a string are commonly different things.

Sticking to a limited set of characters, such as ASCII, you can get a one to one, but modern programming languages are more sophisticated than that, and at least attempt to serve a wider range of written languages.

You generally need to know what the encoding is. You may not get any indication in the file itself.

After reading the bytes, you need to encode those bytes into a string:

open("alex.txt","rb").read().decode('utf-16')

you can have open do this for you, which is likely more reliable:

open("file.txt",encoding='utf-16').read()

Now, if you wanted to be fancy and get the encoding from the BOM, you can look at answers here:

Reading Unicode file data with BOM chars in Python

edited Oct 13 '21 at 22:26

answered Oct 13 '21 at 21:55

Garr Godfrey

8,257
2
25
23

2

You can't decode from a Unicode string to another Unicode string. That code would be valid in Python 2, but not in Python 3. You should put the encoding in the call to `open`, and let the Python IO machinery take care of it for you. – Blckknght Oct 13 '21 at 21:59
You are right, I meant to make it open as binary – Garr Godfrey Oct 13 '21 at 22:26

len() returns wrong number for string

1 Answers1