-1

When I have a file created in Vim/Linux with :set fileencoding=utf-8 and have diacritics (as e.g. german umlauts) in the file, then calling up file myfile.txt results to myfile.txt: UTF-8 Unicode text. If I have no diacritics in the file, then the determination of the file encoding results to myfile.txt: ASCII text.

Why is that? And how can I determine safely, that a whole bunch of files is encoded correctly by using UTF-8 file encoding?

EDIT:

ASCII is 7-bit and is a subset of UTF-8. I want to know if my source files are encoded in UTF-8 so that they can hold diacritics sometime in the future. IMO this is not obvious and I like to find a way to determine this safely.

tripleee
  • 175,061
  • 34
  • 275
  • 318
ferdy
  • 7,366
  • 3
  • 35
  • 46
  • 2
    Note that ASCII is valid UTF-8. Any ASCII file is a perfectly fine UTF-8 encoded file. – nos Jan 28 '16 at 10:30
  • @nos I was stumbling on the question why ``file`` actually needs a real diacritic to show that it's utf-8. Can I furthermore be sure that ASCII is *only* utf-8 and not some ISO8859 encoding? – ferdy Jan 28 '16 at 10:35
  • 2
    If there are no non-ASCII characters, `file` has no evidence for concluding that the file's encoding is anything other than ASCII. – 一二三 Jan 28 '16 at 11:15
  • 1
    @一二三 And indeed, it *is* valid ASCII *and* valid UTF-8 (and valid ISO-8859-1, and a number of other encodings) in this case -- they all overlap. It's correct in this case to report the lowest common denominator, though understandably confusing if you are unaware of the fact that these encodings are all compatible in this subrange. – tripleee Jan 28 '16 at 11:59
  • @tripleee Yes, you're right - they overlap in this subrange, but it will be no valid UTF-8 encoded file when I add an 'ä' to an ISO-8859-1 file later on, but at the time, the file was created as an ISO-8859 file, i have no chance to determine, if it is utf-8 which would be the actual wanted encoding. I'll answer my question myself, as I found a solution to this. – ferdy Jan 28 '16 at 12:15
  • 1
    Appending a random byte to a file is always possible if you have write permission, and can wreck the encoding at any time if the byte you append is not well-defined in the encoding the file had previously. Your mental model of how files are created and saved seems to be flawed. There is no extrinsic "encoding property" of a file which somehow survives if the bytes on the disk are overwritten with other bytes. – tripleee Jan 28 '16 at 12:19
  • @tripleee Thank you. So the encoding of a file is determined at the time the content of a file is interpreted. If I have a utf-8 character double byte (e.g. 'ä' 0xc3a4) and I have a locale in ascii-7bit it will interprete it to the two characters representation of 0xc3 and 0xa4 in ascii. Is that correct? That means UTF-8 is only a common agreement that a character pair 0xc3a4 will be interpreted to an 'ä'. – ferdy Jan 28 '16 at 12:51
  • Yeah, that sounds right, except that 0xC4 and 0xA4 are undefined in ASCII, by definition (it is a 7-bit encoding). – tripleee Jan 28 '16 at 12:57
  • "So the encoding of a file is determined at the time the content of a file is interpreted": NO!!! The encoding is determined by the process that created it. Agree on an encoding, document it/remember it/use it. Once that metadata is lost, you no longer know the text contents. Guessing is just that. My guess is always CP437. How could you prove that wrong? – Tom Blodget Jan 28 '16 at 17:49

1 Answers1

0

There is no generic and reliable way to find which encoding a text file use. Furthermore quite a few encoding are supersets of ASCII-7 (UTF-8, ISO 8859-*, ...)

In the case of UTF-8, one trick is to add an (otherwise unnecessary) BOM (Byte Order Mark) at the beginning of the file. In this case file displays something like :

some.txt: UTF-8 Unicode (with BOM) text

I think that for vim the option is : :set bomb

Unfortunately, while most editors understand the BOM, bash does not. Don't add it to shell scripts !

bwt
  • 17,292
  • 1
  • 42
  • 60