1

There are many, many places describing how to "force" Git to read a file as text. Generally, the solution involves adding a filter to .gitattributes to apply the text attribute to the file(s). Examples include:

* text
* text=auto
* text diff merge
* text=auto diff merge

But this solution seems to not work if the file contains NUL. Here is an example file text file with ANSI encoding and trailing null bytes:

enter image description here

It's completely readable as a text file, just not by Git. Every example filter above will fail and Git will identify as "binary" regardless. I think this is due to its hard-coded check for NUL in the first 8000 characters (ref).

Of course, as soon as I convert the file to UTF-8 Git happily identifies it as text. Here is that same file after conversion:

enter image description here

Frankly I don't mind not using ANSI encoding. I'm just trying to avoid constantly opening files in Notepad++ just to fix the file encoding. Is there a way to make Git handle the encoding conversion automatically?

patricktokeeffe
  • 1,058
  • 1
  • 11
  • 21
  • There is no major single-byte encoding (whether so-called "ANSI" or not) where NUL is anything other than a NUL. The same byte is also a NUL in UTF-8. Your tool is actually stripping these characters incorrectly when converting them to UTF-8. And this is by definition not a text file, since NUL is never valid in a text file, according to POSIX. – bk2204 Sep 18 '20 at 00:52

1 Answers1

0

You have a couple of problems here. The first is that these are definitely not text files, since they contain a NUL byte. No major single-byte encoding permits NUL bytes to represent anything other than a NUL because C terminates its strings with that byte, and using it for another purpose would mean that text in that encoding would not fit into a normal C string. POSIX specifically excludes files containing NUL bytes from being text files for this reason.

The tool you're using to convert your “ANSI” files to UTF-8 is actually stripping out the NUL bytes, which is why they then work. The NUL byte in UTF-8 means exactly the same thing as it does in your single-byte encoding: a NUL. So this works because your tool is stripping them out instead of properly converting them.

It also isn't clear what you're asking Git to do in this case. The text attribute asks Git to perform end-of-line normalization. However, if your file contains NUL bytes, then Git is still going to think it's a binary file for the purposes of diffs and merge, because the text attribute doesn't control that. You need the diff and merge attributes as well.

Of course, if you don't really want or need the NUL bytes and these are supposed to be human-readable, then you really are better off just stripping out the NUL bytes and converting to UTF-8. In 2020, there's no longer any good reason to use a single-byte encoding. If that's what you want to do, then you can strip the NUL bytes and convert to UTF-8 by doing the following (assuming you're using Git Bash, WSL, or a Linux system):

$ tr -d '\0' FILENAME | iconv -f WINDOWS-1252 -t UTF-8 > FILENAME.tmp && \
  mv FILENAME.tmp FILENAME

That also assumes that the “ANSI” encoding you're using is actually Windows-1252. IANA (the register of character sets) doesn't know of any encodings called “ANSI”, but Windows-1252 is the most common character set referred to that way.

Finally, you can specify a working tree encoding with the working-tree-encoding value in gitattributes if you absolutely must handle non-UTF-8 files. That isn't going to fix your NUL problem, though, and UTF-8 is a better choice in almost all situations.

bk2204
  • 64,793
  • 6
  • 84
  • 100
  • 1
    No.No. NULL is permitted in most encoding, and it is not the C terminated string. C choose such convention, and it was also called ASCIIZ. Other protocols just use new line as terminating (or a specific escape). Pascal, Python (and many other languages) allow you to have NULL in a string, and it is ok for ASCII strings. There are UTF-8 encodings (not official from Unicode, but compatible) which allow NULL in string, and `\0` to terminate a string. – Giacomo Catenazzi Sep 18 '20 at 07:21
  • I'm not arguing about any language other than C. I'm fully aware that NUL is allowed in many languages' strings. It is, however, [prohibited in text files by POSIX](https://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap03.html), and I stand by my statement that no major single-byte encoding uses the zero bytes as anything other than NUL. – bk2204 Sep 18 '20 at 12:53
  • ASCII allow NULL. The question is not about POSIX (nobody will use "ANSI" for an encoding in POSIX), and the language in question is not C (but if one use `#define If if` and so on). Linux use null (e.g. in `/proc`, as field separator (the rest is text), and `git` is not just about text files (in posix definition, in facts, it allows different end of line characters). – Giacomo Catenazzi Sep 18 '20 at 13:15
  • I fully understand how Git works in this case and that it can handle any kind of files; I'm a core contributor. The generally understood definition of "text file" (contrasted with "binary file") excludes bytes with NUL; using a standard definition, such as POSIX, to define the term "text file" is entirely reasonable. For example, `file` would call these files “data” because they contain NUL, not “text”; see the manual page for details. – bk2204 Sep 18 '20 at 21:05
  • @bk2204 `file` apparently considers files with an odd number of `NUL` characters data `printf 'foo\0 | file -` -> `data` but even number `ascii` `printf 'foo\0\0' | file -` ‍♂️ – CervEd Dec 18 '21 at 16:12