1

I know that first line of python file always give the info of encoding.

But I don't know even than the first line words are encoded with specific encoding, how does editor know the correct encoding of the first line words.

thanks for you reply

Dongr
  • 61
  • 1
  • File is just a bunch of bytes. The interpreter will check if the first some bytes is `encoding info`. See [utf-8 for Python 3.x](https://stackoverflow.com/questions/25205395/what-is-the-default-encoding-method-for-code-assumed-by-python-interpreter) – heLomaN Sep 27 '20 at 09:35
  • 1
    @heLomaN I think OP's question is how the interpreter decodes the first few bytes for the encoding info without knowing the encoding for those first few bytes. – timgeb Sep 27 '20 at 11:28
  • what do you mean (exactly) by "editor"? – pygri Oct 30 '20 at 16:02
  • Very strongly related: [What's the difference between 'coding=utf8' and '-\*- coding: utf-8 -\*-'?](https://stackoverflow.com/q/20301920) – Martijn Pieters Nov 04 '20 at 08:24
  • Basically: how an editor interprets those lines is up to each individual editor. With most editors _these days_ defaulting to UTF-8, it is easier to just ignore the whole issue, but the PEP 263 format comment standard is specifically designed to support whatever your editor might support. – Martijn Pieters Nov 04 '20 at 08:25
  • "how does editor know the correct encoding of the first line words." The short version is: **it doesn't need to, because** every encoding that may legally be used, will use the **same bytes** for that coding declaration. I contributed an answer to the canonical that goes over this. – Karl Knechtel Mar 16 '23 at 08:48

4 Answers4

2

This is mostly a ramble, because Codec handling in Python is a bit of a ramble.

First, the encoding line deals with the standard python library codec. It's an odd adapter pattern:

  • Odd complications around recognizing various codecs named 'utf-*'
  • The idea of 'Stream' versus 'Incremental' versus basic encode/decoders
  • explicit getregentry() and register() functions instead of using metadata.
  • Poor documentation, and lots of implementation specific tricks.
  • You can start by looking at cpython/Python/codecs.c (the CPython source), which will have more accuracy than the documentation.
  • This is an area where you might find incompatibilities between CPython, Jupyter Python, Pipi, and other implementations.
  • Here there be dragons

More specifically, the encoding line is defined by PEP 263. Because the characters are all low, it should work with endings like UTF-8, iso-8559-1, and others. It's a bit like the old Hayes modem code of "AT" which were two letters that happened to work regardless of parity and byte size settings. The most common other encoding is UTF-16 and variants, which have a BOM.

You might also look at cpython/Parser/tokenizer.c:check_coding_spec(), cpython/Parser/pegen.c:1172 calling PyTokenizer_FromFile(), and others. Its a bit of rabbit hole and you will understand too much of Python's tokenizer before you are done.

The short answer: Python originally opens the file as bytes; it's UTF-8 before leaving the tokenizer, the tokenizer checks for the BOM (Byte Order Mark) and does some magic with the codec processor to read the encoding line, and then uses the encoding line. It messy, but works in enough variants that people are satisfied.

I hope this answers your question.

Charles Merriam
  • 19,908
  • 6
  • 73
  • 83
  • 1
    The handling in Python reflects the long and storied history of the tokenizer. PEP 263 is pretty simple and clear, and any *incompatibilities between CPython, Jupyter Python, Pipi, and other implementations* would be PEP violations. There are **no dragons here**. – Martijn Pieters Nov 04 '20 at 08:20
  • Note that nothing in this answer addresses the actual question: how an **editor** might interpret those lines, if at all. – Martijn Pieters Nov 04 '20 at 08:21
  • There are dragons in the implementation. Writing custom encoders has shown me that. – Charles Merriam Nov 05 '20 at 19:08
  • True. For an editor, its editor specific. Almost all them just set the encoding with a warning if the BOM marks don't match. Some just crash if the Unicode won't parse. – Charles Merriam Nov 05 '20 at 19:09
1

Each editor has it's own built-in algorithms that depend on the byte code and sometimes file extensions to determine the encoding. For most file extensions, if the editor is not able to determine the encoding, it falls back to a common encoding, which is usually UTF-8 for text files and such since it supports a larger set of signs and is widely used.

Take for example, Python itself. During the era of Python 2, the default/fallback encoding for source code was ASCII. So your first few lines where you mention your encoding should be valid ASCII for Python2 to process it. In Python 3, this has been switched to UTF-8. So, the python interpreter will interpret the first few lines as valid UTF-8 and then override it with whatever custom encoding that you provide.

Amit Singh
  • 2,875
  • 14
  • 30
  • "So your first few lines where you mention your encoding should be valid ASCII for Python2 to process it. In Python 3, this has been switched to UTF-8. " <- this is the crucial part I think. So for example if the file is encoded with an encoding FOO which happens to not be a superset of ASCII, Python has no chance of interpreting the `# -*- coding: FOO -*-` line? – timgeb Nov 03 '20 at 21:53
  • 1
    @timgeb Encodings that are not a superset of ASCII are not supported by Python. – Martijn Pieters Nov 04 '20 at 07:17
  • @MartijnPieters Thanks, got a link? – timgeb Nov 04 '20 at 07:41
  • 1
    @timgeb: [PEP 263](https://www.python.org/dev/peps/pep-0263/) is the official reference here: *Any encoding which allows processing the first two lines in the way indicated above is allowed as source code encoding, this includes ASCII compatible encodings as well as certain multi-byte encodings such as Shift_JIS. It does not include encodings which use two or more bytes for all characters like e.g. UTF-16.* Note that Shift JIS is _basically_ a superset of ASCII here (only 0x5C and 0x7E differ, but valid encoding names never use ``\`` or `~`). – Martijn Pieters Nov 04 '20 at 08:15
  • @timgeb: hrm, this was all in response to your bounty? The question here is rather vague, it doesn't really narrow down if this was about an editor interpreting the Python comment or some other form of file encoding detection. If this is about the editor also interpreting the PEP 263 comment then that's up to each editor; [this older answer of mine](https://stackoverflow.com/a/20302074/100297) references the Emacs and VI documentation for these, but Gedit and Kate have similar support, and other editors have plugins that add modeline support. – Martijn Pieters Nov 04 '20 at 08:39
  • @MartijnPieters I should have been clearer in the bounty message. The perfect answer would have covered editors and the interpreter. My bad. I knew the PEP but the "in the way indicated above" was a little vague for me. – timgeb Nov 04 '20 at 08:44
0

I don't believe there is any full-proof way of knowing a file's encoding other than guessing an encoding and then trying to decode with it.

The editor might assume, for example, a UTF-8 encoding, a very common encoding capable of encoding any Unicode character. If the file decodes without errors there is nothing else to do. Otherwise, I am sure the editor has a strategy of trying certain other encodings until it succeeds and produces something without a decoding error or finally gives up. In the case of an editor that understands content, even if the file decodes without an error, the editor might additionally check to see if the content is representative of what is implied by the file type.

Booboo
  • 38,656
  • 3
  • 37
  • 60
-1

I'm not sure if I understood your question. However all IDEs have a default encoding, which for all Python IDEs is UTF-8. At first it checks whether the codepoint is smaller than 128 or larger than 128. From that it understands whether we are using one or more bytes per character. (Therefore UTF-8, UTF-16, or so on).

Another reason the default encoding is UTF-8 is that UTF-8 can handle any Unicode code point.

You can find more info from here: https://docs.python.org/3/howto/unicode.html