How does the Linux command file
recognize the encoding of my files?
zell@ubuntu:~$ file examples.desktop
examples.desktop: UTF-8 Unicode text
zell@ubuntu:~$ file /etc/services
/etc/services: ASCII text
How does the Linux command file
recognize the encoding of my files?
zell@ubuntu:~$ file examples.desktop
examples.desktop: UTF-8 Unicode text
zell@ubuntu:~$ file /etc/services
/etc/services: ASCII text
The man page is pretty clear
The filesystem tests are based on examining the return from a stat(2) system call...
The magic tests are used to check for files with data in particular fixed formats. The canonical example of this is a binary executable (compiled program) a.out file, whose format is defined in #include and possibly #include in the standard include directory. These files have a 'magic number' stored in a particular place near the beginning of the file that tells the UNIX operating system that the file is a binary executable, and which of several types thereof. The concept of a 'magic' has been applied by extension to data files. Any file with some invariant identifier at a small fixed offset into the file can usually be described in this way. The information identifying these files is read from the compiled magic file /usr/share/misc/magic.mgc, or the files in the directory /usr/share/misc/magic if the compiled file does not exist. In addition, if $HOME/.magic.mgc or $HOME/.magic exists, it will be used in preference to the system magic files. If /etc/magic exists, it will be used together with other magic files.
If a file does not match any of the entries in the magic file, it is examined to see if it seems to be a text file. ASCII, ISO-8859-x, non-ISO 8-bit extended-ASCII character sets (such as those used on Macintosh and IBM PC systems), UTF-8-encoded Unicode, UTF-16-encoded Unicode, and EBCDIC character sets can be distinguished by the different ranges and sequences of bytes that constitute printable text in each set. If a file passes any of these tests, its character set is reported.
In short, for regular files, their magic values are tested. If there's no match, then file
checks whether it's a text file, making an educated guess about the specific encoding by looking at the actual values of bytes in the file.
Oh, and you can also download the source code and look at the implementation for yourself.
(and that's the main charset you need to care about)
The source code is on GitHub so anyone can search it. After doing a quick search, things like BOM
, ef bb bf
, and feff
do not appear at all. That means UTF-8, Byte-Order-Mark reading is not supported. Files made in other applications that use or preserve the BOM marker will all be returned as "charset=unknown" when using file
.
In addition, none of the config files mentioned in the Magic File manpage are a part of magic file v. 4.17. In fact, /etc/magicfile/
doesn't exist at all, so I don't see any way in which I can configure it.
If you're stuck trying to get the ACTUAL charset encoding and magic file is all you have, you can determine if you have a UTF-8 file at the Linux CLI with:
hexdump -n 3 -C $path_to_filename
If the above returns the following sequence, ef bb bf
, then you are 99% likely in possession of a BOM-marked UTF-8 file. This is not a 100% certainty, but it is far more useful than magic file
, where it has no handling whatsoever for Byte Order Marks.