First: M-o
, M-;
, and M-?
are representation techniques to show non-ASCII characters as ASCII. Specifically, they're an encoding technique to show that bit 7 (0x80) is set, and the remaining bits are then displayed as if the characters were ASCII. Lowercase o
is code 0x6f
, ;
is 0x3b
, and ?
is 0x3f
. Putting the high bit (0x80) back into all three, and dropping the 0x
and using uppercase, we get the values EF
, BB
, and BF
. If nothing else, you should memorize this sequence—EF BB BF—or at least remember that it exists, because it's the UTF-8 encoding of a Unicode Byte Order Mark or BOM, U+FEFF
(which you should also memorize, at least that it exists).
For more on Unicode in general, see The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!).
When storing Unicode as UTF-16, the byte order mark has a purpose: it tells you whether the stored data is UTF-16-LE, or UTF-16-BE. But when storing Unicode as UTF-8, the byte order mark is almost entirely useless. I personally believe it should never be used. Microsoft, on the other hand, apparently believe it should always be used (or almost always). See the Wikipedia quote below.
... and someone uses the online editor ...
This online editor, apparently, is either written by Microsoft, or by someone who thinks Microsoft is correct. They are inserting a UTF-8 byte order mark in your plain-text file.
Bitbucket Support gave me articles about .gitattributes
...
Unless the online editor looks inside .gitattributes
files, this won't help: it's that editor that is adding the BOM.
That said, since Git 2.18, Git has had the notion of a working-tree-encoding
attribute. Some editors might actually look at this. I may not understand the Microsoft philosophy correctly—I already noted that I disagree with it. I think, though, that they say: store a BOM in any UTF-8 encoded file if the "main" copy of that file should be stored in UTF-16 format. (Side note: the UTF-8 BOM tells you nothing about whether the UTF-16 file would be UTF-16-LE or UTF-16-BE, so—again in my opinion—it's pretty useless as an indicator. See also In UTF-16, UTF-16BE, UTF-16LE, is the endian of UTF-16 the computer's endianness?)
In any case, if this editor does look at some configuration option, setting the configuration option—whatever it is—would help. If it does not, nothing you do here will help. Note that working-tree-encoding
, while related to Unicode encoding, does not imply that a BOM should or should not be included. So, if your Git is 2.18 or later, you have this extra knob you can twiddle, but that's not what it's for. If it does actually help, that's great, but also quite wrong. :-)
The thing that's weirdest about this is:
[The BOM] breaks my *.csproj
files and fails to load projects in Visual Studio.
Visual Studio is a Microsoft product. The Wikipedia page notes that:
Microsoft compilers and interpreters, and many pieces of software on Microsoft Windows such as Notepad treat the BOM as a required magic number rather than use heuristics. These tools add a BOM when saving text as UTF-8, and cannot interpret UTF-8 unless the BOM is present or the file contains only ASCII.
One would think that if their editors insist on adding BOMs, their other programs would be able to handle BOMs.