Convert files between UTF-8 and ISO-8859 on Linux

Question

Every time that I get confronted with Unicode, nothing works. I'm on Linux, and I got these files from Windows:

$file *
file1: UTF-8 Unicode text
file2: ISO-8859 text
file3: ISO-8859 text

Nothing was working until I found out that the files have different encodings. I want to make my life easy and have them all in the same format:

iconv -f UTF-8 -t ISO-8859 file1 > test
iconv: conversion to `ISO-8859' is not supported
Try `iconv --help' or `iconv --usage' for more information.

I tried to convert to ISO because that's only 1 conversion + when I open those ISO files in gedit, the German letter "ü" is displayed just fine. Okay, next try:

iconv -f ISO-8859 -t UTF-8 file2 > test
iconv: conversion from `ISO-8859' is not supported
Try `iconv --help' or `iconv --usage' for more information.

but obviously that didn't work.

That's because "ISO-8859" isn't an encoding. Did you maybe mean ISO-8859-1 or ISO-8859-15? Or [one of the other 14](https://en.wikipedia.org/wiki/ISO/IEC_8859#Table)? — hobbs, Aug 02 '17 at 15:38
Also there might be problem with your utf-8 source file. It can contain character, that can't be represented in ISO-8859. Converting to utf-8 will be much safer. — Marek Vitek, Aug 02 '17 at 15:39

armnotstrong · Accepted Answer · 2017-08-02T16:05:55.477

9

ISO-8859-x (Latin-1) encoding only contains very limited characters, you should always try to encode to UTF-8 to make life easier.

And utf-8 (Unicode) is a superset of ISO 8859 so it will be not surprised you could not convert UTF-8 to ISO 8859

It seems command file just give a very limited info of the file encoding

You could try to guess the from encoding either ISO-8859-1 or ISO-8859-15 or the other from 2~14 as suggested in the comment by @hobbs

And you could get a supported encoding of iconv by iconv -l

If life treats you not easy with guessing the real file encoding, this silly script might help you out :D

edited Aug 02 '17 at 16:05

answered Aug 02 '17 at 15:47

armnotstrong

8,605
16
65
130

2

@user3182532 ISO 8859 is the name of a standard with 16 parts that specify 16 different encodings (with some commonalities but various differences). `file` is telling you that it doesn't *know* which one it is. This is the general problem with 8-bit encodings —- it's easy enough to tell that you have an 8-bit encoding, but figuring out *which* one without prior knowledge is pure guesswork. Try them and see which one comes out right. 8859-15 is a good first guess. – hobbs Aug 03 '17 at 02:24
1

"utf-8 (Unicode) is a superset of ISO 8859" ... *I think this is not true*. Could you please cite or explain, because in my experience, a file saved as `iso-8859-1` interpreted as if it were `utf-8` will definitely come out wrong. – Stewart Jan 17 '18 at 12:42
@Stewart the set of characters encoded by Unicode **is** a superset of the set of characters in each of the ISO-8859 charsets. The characters are encoded differently, though. – Paŭlo Ebermann Jul 01 '19 at 13:06
@PaŭloEbermann @amnotstrong There is a lot of confusion going around here. UTF-8 is _not_ Unicode, UTF-8 _is_ an encoding of Unicode. The way UTF-8 encodes Unicode codepoints means that it _is_ a superset of ASCII however. All valid ASCII characters are identically coded in UTF-8, e.g. `A` is encoded as decimal `65` in both ASCII and UTF-8. It is also encoded as `65` in all ISO8859 charsets because they are _also_ supersets of ASCII. But UTF-8 is _not_ a superset of any ISO8859 charsets. It _is_ possible to convert from any ISO8859 charset to UTF-8 _because_ UTF-8 encodes all of Unicode. – Andreas Magnusson Feb 23 '22 at 12:17
I guess the word "superset" is a bit too stretched here. Unicode is a set of characters, each with a number (called code point). UTF-8 (and the other UTFs) are an encoding of the unicode characters as bytes. ASCII and ISO-8859-x are character sets (each with different characters) and encodings of these characters into bytes. Unicode is a superset of all these character sets, but the encoding in UTF-8 is different than the encoding in ISO-8859-x. ASCII (as a character set) is a subset of the ISO-8859-x, and also of Unicode, and ASCII (as an encoding) is a "subencoding" of ISO-8859-x and UTF-8). – Paŭlo Ebermann Feb 23 '22 at 16:26

score 1 · Answer 2 · answered Jun 30 '21 at 18:59

As in other answers, you can list out the supported formats

iconv -l | grep 8859

A grep will save your time to find which version of your encoding is/are supported. You can provide the <number> as in my example or ISO or any expected string in your encoding.

Convert files between UTF-8 and ISO-8859 on Linux

2 Answers2