file on UTF-8 and ISO8859-1

Question

Currently I have a program, that is trying to mimic the functionality of the (linux) file command. I parse a .txt file with some characters, and interpret it to its respective interpretation. However, I struggle in differentiating file, when it comes to ISO8859-1 (latin 1). As it converts ISO8859-1 characters as UTF-8 encodings instead (for instance the æ = e6, is encoded as c3 b8 instead?).

When I make and pass this .txt into file:

printf "æøå" > test.txt

file test.txt

it returns simply:

UTF-8 Unicode text, with no line terminators.

* od -c -tx1 test.txt : returns *

0000000 303 246 303 270 303 245
         c3  a6  c3  b8  c3  a5
0000006

Can anyone explain to me why this is the case, as the 'æøå' prefix is contained within ISO8859-1 encoding, but is then interpreted as a UTF8 encoding instead?

Please [edit] your question and show the output of `od -c -tx1 test.txt` to make sure the file really contains the expected hex values. BTW: Although you might want to implement something in C, your question is not related to C as it only mentions some shell commands. — Bodo, Sep 17 '19 at 08:03
@Bodo I corrected the question and shown the output of the file, when running the command. It makes sense it interprets the values are 2-bytes and then convert it to UTF8. This however is wierd for me, as the ISO8859-1 standard includes æøå in their range of 160-255. — NewDev90, Sep 17 '19 at 08:07
It's unclear what you find weird in this. If your terminal encoding is UTF-8, wouldn't you expect file to be created in UTF-8? At what point do you expect ISO8859-1 to be involved? — user694733, Sep 17 '19 at 08:19

Bodo · Accepted Answer · 2019-09-17T12:55:41.353

4

Obviously your file contains UTF-8 encoding. For example c3 a6 is the UTF-8 encoding for æ.

Probably your system locale is set to something with UTF-8. You can check this by running the locale command.

To convert your file from UTF-8 to ISO8859-1 you can use

recode utf8..iso8859-1 test.txt

After this you will get

$ od -c -tx1 test.txt            
0000000 346 370 345
         e6  f8  e5
0000003

As noted by R.., you might have to install recode if it is not already installed. You can also use iconv, but this tool cannot do in-place modification. See also Best way to convert text files between character sets? and https://unix.stackexchange.com/q/10241/330217

edited Sep 17 '19 at 12:55

answered Sep 17 '19 at 08:16

Bodo

9,287
1
13
29

Thanks that is probably what I missed :-) did not know about the recode command! – NewDev90 Sep 17 '19 at 09:08
The `iconv` command is the standard way to do this. `recode` is a random utility that might or might not be installed. – R.. GitHub STOP HELPING ICE Sep 17 '19 at 12:28

score 2 · Answer 2 · answered Sep 17 '19 at 12:36

Bodo's answer is correct, but I think the root of your problem is the ambiguity of the term "character set". You're correct that all those characters are in the set of characters available in ISO-8859-1, but that's not terribly relevant; all it means is that you can faithfully represent them when encoding your text as ISO-8859-1. The ambiguity (some might even say misuse) of the word "set" here is why, in modern usage, the concept is called "coded character set" or preferably "character encoding", to reflect that the important aspect is how abstract characters in the set of available characters map to stored representations.

As sets, ISO-8859-1 is a subset of Unicode and thus a subset of the set of characters representable by UTF-8. But as encodings they don't agree anywhere except the subset that is ASCII. All other characters present in ISO-8859-1 are represented differently in UTF-8 than in ISO-8859-1; if this weren't the case, there would be no way to represent more than 256 characters since in ISO-8859-1 the meanings of all 256 bytes are assigned (to single characters).

As noted in Bodo's answer, æ is encoded in UTF-8 as c3 a6, whereas in ISO-8859-1 it's encoded as e6.

file on UTF-8 and ISO8859-1

2 Answers2