I know that first line of python file always give the info of encoding.
But I don't know even than the first line words are encoded with specific encoding, how does editor know the correct encoding of the first line words.
thanks for you reply
I know that first line of python file always give the info of encoding.
But I don't know even than the first line words are encoded with specific encoding, how does editor know the correct encoding of the first line words.
thanks for you reply
This is mostly a ramble, because Codec handling in Python is a bit of a ramble.
First, the encoding line deals with the standard python library codec
. It's an odd adapter pattern:
getregentry()
and register()
functions instead of using metadata.cpython/Python/codecs.c
(the CPython source), which will have more accuracy than the documentation.More specifically, the encoding line is defined by PEP 263. Because the characters are all low, it should work with endings like UTF-8, iso-8559-1, and others. It's a bit like the old Hayes modem code of "AT" which were two letters that happened to work regardless of parity and byte size settings. The most common other encoding is UTF-16 and variants, which have a BOM.
You might also look at cpython/Parser/tokenizer.c:check_coding_spec()
, cpython/Parser/pegen.c:1172
calling PyTokenizer_FromFile()
, and others. Its a bit of rabbit hole and you will understand too much of Python's tokenizer before you are done.
The short answer: Python originally opens the file as bytes; it's UTF-8 before leaving the tokenizer, the tokenizer checks for the BOM (Byte Order Mark) and does some magic with the codec processor to read the encoding line, and then uses the encoding line. It messy, but works in enough variants that people are satisfied.
I hope this answers your question.
Each editor has it's own built-in algorithms that depend on the byte code and sometimes file extensions to determine the encoding. For most file extensions, if the editor is not able to determine the encoding, it falls back to a common encoding, which is usually UTF-8
for text files and such since it supports a larger set of signs and is widely used.
Take for example, Python itself. During the era of Python 2, the default/fallback encoding for source code was ASCII. So your first few lines where you mention your encoding should be valid ASCII for Python2 to process it. In Python 3, this has been switched to UTF-8. So, the python interpreter will interpret the first few lines as valid UTF-8 and then override it with whatever custom encoding that you provide.
I don't believe there is any full-proof way of knowing a file's encoding other than guessing an encoding and then trying to decode with it.
The editor might assume, for example, a UTF-8 encoding, a very common encoding capable of encoding any Unicode character. If the file decodes without errors there is nothing else to do. Otherwise, I am sure the editor has a strategy of trying certain other encodings until it succeeds and produces something without a decoding error or finally gives up. In the case of an editor that understands content, even if the file decodes without an error, the editor might additionally check to see if the content is representative of what is implied by the file type.
I'm not sure if I understood your question. However all IDEs have a default encoding, which for all Python IDEs is UTF-8. At first it checks whether the codepoint is smaller than 128 or larger than 128. From that it understands whether we are using one or more bytes per character. (Therefore UTF-8, UTF-16, or so on).
Another reason the default encoding is UTF-8 is that UTF-8 can handle any Unicode code point.
You can find more info from here: https://docs.python.org/3/howto/unicode.html