0

I was experimenting with simple code to see how read() behaves on text files. So I made a simple txt file with the following:

AB

BA

Tried to output to console the fist 2 characters.

With encoding set to "ansi" to both txt file and open() the output is correct.

With encoding set to "utf-8" to both txt file and open() the output is A.

With encoding set to "utf-8" to txt file and open() set to default the output is ο».

What is going on ? locale.getpreferredencoding() returns cp1253. Could be that ο» character's messing with my utf-8 encoding? How can I get rid of it?

My code:

current_dir = "some_directory" #doesn't really matter 
file_name = "name_of_text.txt"
full_path = current_dir+file_name
file_mode = "rt"

f = open(full_path,mode = file_mode) # add encoding = "utf_8" or "ansi" to replicate
reader = f.read(2)
print(reader)

f.close()
snakecharmerb
  • 47,570
  • 11
  • 100
  • 153
Demis
  • 197
  • 2
  • 10
  • Look at [the documentation](https://docs.python.org/3/tutorial/inputoutput.html). The optional number passed to `read()` is the size in bytes, not the number of characters. Your question seems predicated on it returning 2 characters rather than 2 bytes. Different encodings use different numbers of bytes. – John Coleman Apr 22 '19 at 18:48
  • @JohnColemanYes I read that but how can I possibly know the character size for every encoding ? So later i searched elseware and came to this wich stated about number of characters https://www.w3schools.com/python/python_file_open.asp seems I can't trust w3schools ... – Demis Apr 22 '19 at 19:01
  • This might help: https://stackoverflow.com/q/2988211/4996248, but even simpler would be just `f.read()[:2]` (unless the file is so large that memory is an issue). – John Coleman Apr 22 '19 at 19:08

1 Answers1

1

The files have been encoded with the utf-8-sig codec, used by some Microsoft applications when UTF-8 encoding is required. This codec inserts three marker characters at the beginning of the file (described in this section of the codecs docs).

When you decode with UTF-8 the marker characters are read as a single, invisible, character (UTF-8 characters may be composed of more than one byte), so you only see 'A'.

When you decode with no encoding specified cp1253 is used, and it treats the marker characters as normal characters, hence the output that you see:

>>> 'AB'.encode('utf-8-sig').decode('cp1253')[:2]
'ο»'
snakecharmerb
  • 47,570
  • 11
  • 100
  • 153
  • Interesting that using encoding = 'utf-8-sig' returns two characters, I thought utf used more than 1 byte per character. Do you know which encoding I should use for unicode ? "Unicode" is not a valid encoding input. – Demis Apr 22 '19 at 19:20
  • @Demis I would recommend always using UTF-8 where possible. utf-8-sig is ok if you are coding for yourself on a Windows machine - it won't complain if given standard UTF-8 text to decode - but if you are sharing data with others I would stick to standard UTF-8. – snakecharmerb Apr 22 '19 at 19:35