1

Firstly, what is M-oM-;M-? ?

When I push a commit to bitbucket, and someone uses the online editor to make a small change, it changes the first line from:

<?xml version="1.0" encoding="utf-8"?>

to:

M-oM-;M-?<?xml version="1.0" encoding="utf-8"?>

I can see these special characters using cat -A <myfile>

This is a problem because this breaks my *.csproj files and fails to load projects in Visual Studio.

Bitbucket Support gave me articles about .gitattributes, and config, which I've already tried, but the issue persists:

$ git config core.autocrlf
true

$ cat .gitattributes
*.js text
*.cs text
*.xml text
*.csproj text
*.sln text
*.config text
*.cshtml text
*.json text
*.sql text
*.ts text
*.xaml text

I've also tried:

$ cat .gitattributes
*.js text eol=crlf
*.cs text eol=crlf
*.xml text eol=crlf
*.csproj text eol=crlf
*.sln text eol=crlf
*.config text eol=crlf
*.cshtml text eol=crlf
*.json text eol=crlf
*.sql text eol=crlf
*.ts text eol=crlf
*.xaml text eol=crlf

Is there some setting that I'm missing to help prevent this set of characters from being inserted into the start of my files?

JacobIRR
  • 8,545
  • 8
  • 39
  • 68
  • Ever figure out a solution? We are facing a similar problem. – Terry Oct 06 '19 at 16:35
  • @Terry I literally told my team “do not use the bitbucket editor!” – JacobIRR Oct 06 '19 at 16:38
  • Haha, that's what I want to say, but for a migration period we are facing, it would be easier for one of the network admins to quickly go in and change some configuration settings. :( – Terry Oct 06 '19 at 16:42
  • 1
    Actually it is weird. It has to be a per user setting. I just went in and edited an xml file online and didn't cause the corruption to occur. Investigating more. Will post update if I have any. I'm wondering if it is different user or environment that the browser is running in that is causing the problem. – Terry Oct 06 '19 at 16:48

1 Answers1

5

First: M-o, M-;, and M-? are representation techniques to show non-ASCII characters as ASCII. Specifically, they're an encoding technique to show that bit 7 (0x80) is set, and the remaining bits are then displayed as if the characters were ASCII. Lowercase o is code 0x6f, ; is 0x3b, and ? is 0x3f. Putting the high bit (0x80) back into all three, and dropping the 0x and using uppercase, we get the values EF, BB, and BF. If nothing else, you should memorize this sequence—EF BB BF—or at least remember that it exists, because it's the UTF-8 encoding of a Unicode Byte Order Mark or BOM, U+FEFF (which you should also memorize, at least that it exists).

For more on Unicode in general, see The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!).

When storing Unicode as UTF-16, the byte order mark has a purpose: it tells you whether the stored data is UTF-16-LE, or UTF-16-BE. But when storing Unicode as UTF-8, the byte order mark is almost entirely useless. I personally believe it should never be used. Microsoft, on the other hand, apparently believe it should always be used (or almost always). See the Wikipedia quote below.

... and someone uses the online editor ...

This online editor, apparently, is either written by Microsoft, or by someone who thinks Microsoft is correct. They are inserting a UTF-8 byte order mark in your plain-text file.

Bitbucket Support gave me articles about .gitattributes ...

Unless the online editor looks inside .gitattributes files, this won't help: it's that editor that is adding the BOM.

That said, since Git 2.18, Git has had the notion of a working-tree-encoding attribute. Some editors might actually look at this. I may not understand the Microsoft philosophy correctly—I already noted that I disagree with it. I think, though, that they say: store a BOM in any UTF-8 encoded file if the "main" copy of that file should be stored in UTF-16 format. (Side note: the UTF-8 BOM tells you nothing about whether the UTF-16 file would be UTF-16-LE or UTF-16-BE, so—again in my opinion—it's pretty useless as an indicator. See also In UTF-16, UTF-16BE, UTF-16LE, is the endian of UTF-16 the computer's endianness?)

In any case, if this editor does look at some configuration option, setting the configuration option—whatever it is—would help. If it does not, nothing you do here will help. Note that working-tree-encoding, while related to Unicode encoding, does not imply that a BOM should or should not be included. So, if your Git is 2.18 or later, you have this extra knob you can twiddle, but that's not what it's for. If it does actually help, that's great, but also quite wrong. :-)

The thing that's weirdest about this is:

[The BOM] breaks my *.csproj files and fails to load projects in Visual Studio.

Visual Studio is a Microsoft product. The Wikipedia page notes that:

Microsoft compilers and interpreters, and many pieces of software on Microsoft Windows such as Notepad treat the BOM as a required magic number rather than use heuristics. These tools add a BOM when saving text as UTF-8, and cannot interpret UTF-8 unless the BOM is present or the file contains only ASCII.

One would think that if their editors insist on adding BOMs, their other programs would be able to handle BOMs.

torek
  • 448,244
  • 59
  • 642
  • 775