LPTHW Exercise 17. Why the output from len() is not what the exercise says?

Question

I wrote the following Python 3 script:

from sys import argv
from os.path import exists

script, from_file, to_file = argv

print(f"Copying from {from_file} to {to_file}")

in_file = open(from_file)
indata = in_file.read()

print(f"The input file is {len(indata)} bytes long")

print(f"Does the output file exist? {exists(to_file)}")
print("Ready, hit RETURN to continue, CTRL-C to abort.")
input()

out_file = open(to_file, 'w')
out_file.write(indata)

print("Alright, all done.")

out_file.close()
in_file.close()

Apparently the output of len(indata) should be:

The input file is 21 bytes long

But I get:

The input file is 46 bytes long

The from_file is a file called test.txt which contains the text "This is a test file."

I double-checked the text inside test.txt. I thought that the difference may be on the computer since I'm using Windows and the teacher doesn't.

Expected output of the exercise according to Zed

This is my first post here and I already tried to find something about this issue. Although I found some questions about exercise 17, I found nothing about the bytes difference.

Your file may contains blank spaces. Oh and yes: which Python version are you using ? — bruno desthuilliers, Dec 18 '17 at 14:35
Both Windows and my Ubuntu VM confirm that this should be *20* bytes long. Possibly errata within the book. — Mangohero1, Dec 18 '17 at 14:40
Ups, thanks, @brunodesthuilliers, super newbie mistake. Python 3, I will edit the question. — Naiara, Dec 18 '17 at 14:54
I see @Mangohero1. Your output is closet to the "correct one". Thanks for check it. — Naiara, Dec 18 '17 at 14:55
@Mangohero1 There's probably a newline character at the end. — trent, Dec 18 '17 at 15:03
Ok, I get the very same numbers (21) on both python 2.7.6 and 3.4.3 on ubuntu so the difference doesn't come from python3 (or at least not _only_ from python3) — bruno desthuilliers, Dec 18 '17 at 15:03
@Naiara: can you try stripping the content and posting the result ? (replace `indata = in_file.read()` with `indata = in_file.read().strip()`) — bruno desthuilliers, Dec 18 '17 at 15:05
@brunodesthuilliers: why strip the result? That'll just hide what might be relevant. I'd just add `print(repr(indata))` instead. — DSM, Dec 18 '17 at 15:05
@DSM yes that's a solution too, and probably a better one actually ;) The point of stripping was to know (without visual inspection) if the diff comes from whitespaces or something else. — bruno desthuilliers, Dec 18 '17 at 15:10
@brunodesthuilliers and @DSM I used `print(repr(indata))` as you suggested and I got this: `'ÿþT\x00h\x00i\x00s\x00 \x00i\x00s\x00 \x00a\x00 \x00t\x00e\x00s\x00t\x00 \x00f\x00i\x00l\x00e\x00.\x00'` Wich is something I never saw before. — Naiara, Dec 18 '17 at 15:21
This seriously looks like the utf-16 encoded representation of "This is a test file" - except for the BOM (should be `"\xff\xfe"`, not "ÿþ"). You have to understand that strings/byte strings/unicode strings etc are totally different beasts in Python2 and Python3. How did you generate your test file ? — bruno desthuilliers, Dec 18 '17 at 15:34
@brunodesthuilliers I wrote `PS C:\Users\naiara\lpthw\ex17> echo "This is a test file." > test.txt` on Power Shell. — Naiara, Dec 18 '17 at 16:01
@brunodesthuilliers: and now we see the extra information came in handy. ;-) — DSM, Dec 18 '17 at 16:19

trent · Accepted Answer · 2017-12-19T12:22:57.803

Short version

You get this output because the file is encoded as UTF-16, probably because the editor you used to save it has that behavior on Windows, and you didn't specify an encoding to read it with, so Python guessed wrong. To avoid this kind of issue, you should always add an encoding argument to the open function, whether reading or writing:

in_file = open(from_file, encoding='utf-16')
# ...
out_file = open(to_file, 'w', encoding='utf-16')

Long version

21 is the number of bytes in the file when encoded as UTF-8 with a terminating LF character ('\n'), without a byte order mark (BOM).

46 is the number of bytes in the file when encoded as UTF-16 with a terminating CR+LF combination ('\r\n') and a BOM (byte-order mark).

Much as we'd like to think text is "just text", it has to be encoded somehow into bytes (see this Q&A for more information). On Linux, the most widely followed convention is to use UTF-8 for everything. On Windows, UTF-16 is more common, but you also get other encodings.

Python's open function has an encoding argument that you can use to tell Python that the file you're opening is UTF-16, and then you'll get a different result:

in_file = open(from_file, encoding='utf-16')

What's it doing instead? Well, the open function is documented to use locale.getpreferredencoding(False) if you don't specify an encoding, so you can find out by typing import locale; locale.getpreferredencoding(False). But I can save you the effort by telling you that the preferred encoding on Windows is Windows-1252. And if you take the string "This is a test file.", encode it into UTF-16, and decode it as Windows-1252, you'll see the unusual string you discovered:

>>> line = "This is a test file."
>>> line_bytes = line.encode('utf-16')
>>> line_bytes.decode('windows-1252')
'ÿþT\x00h\x00i\x00s\x00 \x00i\x00s\x00 \x00a\x00 \x00t\x00e\x00s\x00t\x00 \x00f\x00i\x00l\x00e\x00.\x00'

The ÿþ is how Windows-1252 treats the BOM. There's still something not quite right, since len(line_bytes) is only 42, not 46, so I have to assume something else is going on with the line endings; if you add \r\n to the original string you do get a 46-character string.

Note that even on Linux, Zed's output is misleading: the input file is 21 Unicode code points long, not 21 bytes. It happens to also be 21 bytes only because all the characters in the file are in the ASCII subset of UTF-8 (which is the preferred encoding on Linux, and can be encoded into one byte per character).

Really interesting @trentcl. I repeated the full exercise and I got a different result, 42 bytes. This time, before I tried the script I opened test.txt with the Windows Notepad and I pressed "Supr" just after the dot just to be sure there wasn't an extra line or something I couldn't see. Since I got a different result, something changed... By the way, thanks for your answer I upvoted your comment, however, I don't have yet enough reputation to upvote "publically". — Naiara, Dec 18 '17 at 16:09
@Naiara No problem, and welcome to Stack Overflow! You can accept an answer by clicking the check mark next to it, but I suggest you wait a day or so before doing so because answered questions are less likely to get attention, including upvotes, corrections and additional answers. — trent, Dec 18 '17 at 16:23
@Naiara By the way, you can edit the file in Notepad and when you click "Save as..." there is an Encoding option that you can use to save in UTF-8. Most editors support this feature one way or another. I don't know how you would do that in PowerShell or if it's even possible. — trent, Dec 18 '17 at 16:28
I saved the file in UTF-8. The result now is `The input file is 23 bytes long` which is really close to the original. I guess my question was a little basic or pointless, I mean, 23 bytes or 42 bytes in such a simple file... But thanks to all your answers guys I learned something new about the Encoding format and that's great. — Naiara, Dec 19 '17 at 11:45
@Naiara That could mean it's encoded as UTF-8 without a line terminator but with a byte-order mark (the BOM in UTF-8 is 3 bytes, see also [this answer](https://stackoverflow.com/questions/6769311/how-windows-notepad-interpret-characters#6769431)), but you are still decoding it as Windows-1252. If you `print(repr(indata))` now you will probably get something like `'ï»¿This is a test file.'` You *have* to specify the exact encoding to correctly read from (or write to) a text file. — trent, Dec 19 '17 at 12:16
[Apparently you can also use the `utf-8-sig` encoding to ignore the BOM.](https://stackoverflow.com/a/44573867/3650362) — trent, Dec 19 '17 at 12:21

LPTHW Exercise 17. Why the output from len() is not what the exercise says?

1 Answers1

Short version

Long version