Short version
You get this output because the file is encoded as UTF-16, probably because the editor you used to save it has that behavior on Windows, and you didn't specify an encoding to read it with, so Python guessed wrong. To avoid this kind of issue, you should always add an encoding argument to the open
function, whether reading or writing:
in_file = open(from_file, encoding='utf-16')
# ...
out_file = open(to_file, 'w', encoding='utf-16')
Long version
21 is the number of bytes in the file when encoded as UTF-8 with a terminating LF character ('\n'
), without a byte order mark (BOM).
46 is the number of bytes in the file when encoded as UTF-16 with a terminating CR+LF combination ('\r\n'
) and a BOM (byte-order mark).
Much as we'd like to think text is "just text", it has to be encoded somehow into bytes (see this Q&A for more information). On Linux, the most widely followed convention is to use UTF-8 for everything. On Windows, UTF-16 is more common, but you also get other encodings.
Python's open
function has an encoding
argument that you can use to tell Python that the file you're opening is UTF-16, and then you'll get a different result:
in_file = open(from_file, encoding='utf-16')
What's it doing instead? Well, the open
function is documented to use locale.getpreferredencoding(False)
if you don't specify an encoding, so you can find out by typing import locale; locale.getpreferredencoding(False)
. But I can save you the effort by telling you that the preferred encoding on Windows is Windows-1252. And if you take the string "This is a test file."
, encode it into UTF-16, and decode it as Windows-1252, you'll see the unusual string you discovered:
>>> line = "This is a test file."
>>> line_bytes = line.encode('utf-16')
>>> line_bytes.decode('windows-1252')
'ÿþT\x00h\x00i\x00s\x00 \x00i\x00s\x00 \x00a\x00 \x00t\x00e\x00s\x00t\x00 \x00f\x00i\x00l\x00e\x00.\x00'
The ÿþ
is how Windows-1252 treats the BOM. There's still something not quite right, since len(line_bytes)
is only 42, not 46, so I have to assume something else is going on with the line endings; if you add \r\n
to the original string you do get a 46-character string.
Note that even on Linux, Zed's output is misleading: the input file is 21 Unicode code points long, not 21 bytes. It happens to also be 21 bytes only because all the characters in the file are in the ASCII subset of UTF-8 (which is the preferred encoding on Linux, and can be encoded into one byte per character).