How does the Linux command `file` recognize the encoding of my files?

Question

How does the Linux command file recognize the encoding of my files?

zell@ubuntu:~$ file examples.desktop 
examples.desktop: UTF-8 Unicode text

zell@ubuntu:~$ file /etc/services 
/etc/services: ASCII text

Possible duplicate of [How does Linux recognize a file as a certain file type, and how can I change it?](https://stackoverflow.com/questions/10131631/what-causes-the-computer-to-recognize-a-file-as-a-certain-file-type-and-how-can), [How to find encoding of a file via script on Linux?](https://stackoverflow.com/q/805418/608639), etc. — jww, Oct 15 '19 at 00:53

score 1 · Answer 1 · answered Oct 10 '19 at 18:32

The man page is pretty clear

The filesystem tests are based on examining the return from a stat(2) system call...

The magic tests are used to check for files with data in particular fixed formats. The canonical example of this is a binary executable (compiled program) a.out file, whose format is defined in #include and possibly #include in the standard include directory. These files have a 'magic number' stored in a particular place near the beginning of the file that tells the UNIX operating system that the file is a binary executable, and which of several types thereof. The concept of a 'magic' has been applied by extension to data files. Any file with some invariant identifier at a small fixed offset into the file can usually be described in this way. The information identifying these files is read from the compiled magic file /usr/share/misc/magic.mgc, or the files in the directory /usr/share/misc/magic if the compiled file does not exist. In addition, if $HOME/.magic.mgc or $HOME/.magic exists, it will be used in preference to the system magic files. If /etc/magic exists, it will be used together with other magic files.

If a file does not match any of the entries in the magic file, it is examined to see if it seems to be a text file. ASCII, ISO-8859-x, non-ISO 8-bit extended-ASCII character sets (such as those used on Macintosh and IBM PC systems), UTF-8-encoded Unicode, UTF-16-encoded Unicode, and EBCDIC character sets can be distinguished by the different ranges and sequences of bytes that constitute printable text in each set. If a file passes any of these tests, its character set is reported.

In short, for regular files, their magic values are tested. If there's no match, then file checks whether it's a text file, making an educated guess about the specific encoding by looking at the actual values of bytes in the file.

Oh, and you can also download the source code and look at the implementation for yourself.

score 0 · Answer 2 · answered Sep 24 '22 at 19:08

TLDR: Magic File Doesn't Support UTF-8 BOM Markers

(and that's the main charset you need to care about)

The source code is on GitHub so anyone can search it. After doing a quick search, things like BOM, ef bb bf, and feff do not appear at all. That means UTF-8, Byte-Order-Mark reading is not supported. Files made in other applications that use or preserve the BOM marker will all be returned as "charset=unknown" when using file.

In addition, none of the config files mentioned in the Magic File manpage are a part of magic file v. 4.17. In fact, /etc/magicfile/ doesn't exist at all, so I don't see any way in which I can configure it.

If you're stuck trying to get the ACTUAL charset encoding and magic file is all you have, you can determine if you have a UTF-8 file at the Linux CLI with:

hexdump -n 3 -C $path_to_filename

If the above returns the following sequence, ef bb bf, then you are 99% likely in possession of a BOM-marked UTF-8 file. This is not a 100% certainty, but it is far more useful than magic file, where it has no handling whatsoever for Byte Order Marks.

How does the Linux command `file` recognize the encoding of my files?

2 Answers2

TLDR: Magic File Doesn't Support UTF-8 BOM Markers

Related