9

Every time that I get confronted with Unicode, nothing works. I'm on Linux, and I got these files from Windows:

$file *
file1: UTF-8 Unicode text
file2: ISO-8859 text
file3: ISO-8859 text

Nothing was working until I found out that the files have different encodings. I want to make my life easy and have them all in the same format:

iconv -f UTF-8 -t ISO-8859 file1 > test
iconv: conversion to `ISO-8859' is not supported
Try `iconv --help' or `iconv --usage' for more information.

I tried to convert to ISO because that's only 1 conversion + when I open those ISO files in gedit, the German letter "ü" is displayed just fine. Okay, next try:

iconv -f ISO-8859 -t UTF-8 file2 > test
iconv: conversion from `ISO-8859' is not supported
Try `iconv --help' or `iconv --usage' for more information.

but obviously that didn't work.

Cody Gray - on strike
  • 239,200
  • 50
  • 490
  • 574
user3182532
  • 1,097
  • 5
  • 22
  • 37
  • 7
    That's because "ISO-8859" isn't an encoding. Did you maybe mean ISO-8859-1 or ISO-8859-15? Or [one of the other 14](https://en.wikipedia.org/wiki/ISO/IEC_8859#Table)? – hobbs Aug 02 '17 at 15:38
  • Also there might be problem with your utf-8 source file. It can contain character, that can't be represented in ISO-8859. Converting to utf-8 will be much safer. – Marek Vitek Aug 02 '17 at 15:39

2 Answers2

9

ISO-8859-x (Latin-1) encoding only contains very limited characters, you should always try to encode to UTF-8 to make life easier.

And utf-8 (Unicode) is a superset of ISO 8859 so it will be not surprised you could not convert UTF-8 to ISO 8859

It seems command file just give a very limited info of the file encoding

You could try to guess the from encoding either ISO-8859-1 or ISO-8859-15 or the other from 2~14 as suggested in the comment by @hobbs

And you could get a supported encoding of iconv by iconv -l

If life treats you not easy with guessing the real file encoding, this silly script might help you out :D

armnotstrong
  • 8,605
  • 16
  • 65
  • 130
  • 2
    @user3182532 ISO 8859 is the name of a standard with 16 parts that specify 16 different encodings (with some commonalities but various differences). `file` is telling you that it doesn't *know* which one it is. This is the general problem with 8-bit encodings —- it's easy enough to tell that you have an 8-bit encoding, but figuring out *which* one without prior knowledge is pure guesswork. Try them and see which one comes out right. 8859-15 is a good first guess. – hobbs Aug 03 '17 at 02:24
  • 1
    "utf-8 (Unicode) is a superset of ISO 8859" ... *I think this is not true*. Could you please cite or explain, because in my experience, a file saved as `iso-8859-1` interpreted as if it were `utf-8` will definitely come out wrong. – Stewart Jan 17 '18 at 12:42
  • @Stewart the set of characters encoded by Unicode **is** a superset of the set of characters in each of the ISO-8859 charsets. The characters are encoded differently, though. – Paŭlo Ebermann Jul 01 '19 at 13:06
  • @PaŭloEbermann @amnotstrong There is a lot of confusion going around here. UTF-8 is _not_ Unicode, UTF-8 _is_ an encoding of Unicode. The way UTF-8 encodes Unicode codepoints means that it _is_ a superset of ASCII however. All valid ASCII characters are identically coded in UTF-8, e.g. `A` is encoded as decimal `65` in both ASCII and UTF-8. It is also encoded as `65` in all ISO8859 charsets because they are _also_ supersets of ASCII. But UTF-8 is _not_ a superset of any ISO8859 charsets. It _is_ possible to convert from any ISO8859 charset to UTF-8 _because_ UTF-8 encodes all of Unicode. – Andreas Magnusson Feb 23 '22 at 12:17
  • I guess the word "superset" is a bit too stretched here. Unicode is a set of characters, each with a number (called code point). UTF-8 (and the other UTFs) are an encoding of the unicode characters as bytes. ASCII and ISO-8859-x are character sets (each with different characters) and encodings of these characters into bytes. Unicode is a superset of all these character sets, but the encoding in UTF-8 is different than the encoding in ISO-8859-x. ASCII (as a character set) is a subset of the ISO-8859-x, and also of Unicode, and ASCII (as an encoding) is a "subencoding" of ISO-8859-x and UTF-8). – Paŭlo Ebermann Feb 23 '22 at 16:26
1

As in other answers, you can list out the supported formats

iconv -l | grep 8859 

A grep will save your time to find which version of your encoding is/are supported. You can provide the <number> as in my example or ISO or any expected string in your encoding.

smilyface
  • 5,021
  • 8
  • 41
  • 57