-1

Long long time ago before world scripts birth, text files are all ASCII.
Nowadays, we have world scripts.
I would like to ask if I open up a text file in a hex editor, is there a way to tell its code page is in ASCII or UTF-8?

pat
  • 57
  • 1
  • 10
  • 1
    Um, what is a "world script"? And no, text files were *never* "all ASCII". – Nicol Bolas Jul 16 '20 at 16:13
  • Sorry, world script is an old Apple technology to let the user to enter languages other than English into a file and save as unicode file. – pat Jul 16 '20 at 16:49
  • And no, the files were not only ASCII. We had various other *standards* and conventions. Luckily you never read about EBCDIC, so you have no nightmares. And in more recent times, files used extended ASCII (every extension incompatible with others, but for the standard ASCII part). And if you want to sleep good, do not look what the first 32 ASCII character originally stand for, and how they are used/not-used. – Giacomo Catenazzi Jul 17 '20 at 07:21
  • Does this answer your question? [How can I find encoding of a file via a script on Linux?](https://stackoverflow.com/questions/805418/how-can-i-find-encoding-of-a-file-via-a-script-on-linux) – tripleee Mar 24 '23 at 17:38

1 Answers1

2

UTF-8 is backwards compatible with ASCII: an ASCII text file is also a UTF-8 text file.

If a file contains bytes starting with 8 through F it's not ASCII.

If a file is not ASCII, it may be UTF-8 if every byte that starts with C, D, E, or F is followed by one to three bytes that start with 8, 9, A, or B. If any of these bytes appears in any other context it's not UTF-8.

There are a few more requirements for valid UTF-8, but they are harder to glean with a hex editor. See https://en.m.wikipedia.org/wiki/UTF-8

Joni
  • 108,737
  • 14
  • 143
  • 193