bad chars in a UTF-8 file

Question

I have a file that geany tells me is a UTF-8 file but with characters like:

Ã¹ instead of ù

and so on. That's because 0xC3 + 0xB9 are considered two characters instead of just U+00F9, right? With geany the encoding of the file is already UTF-8, if I switch to ISO-8859 of course I don't get corrected characters.

Is there like a bash command, java class, ruby module or a magic potion that can automatically change this thing without the need to do this by hand?

EDIT:

If i try to switch I can't save the file because I have errors like:

Error message: Invalid byte sequence in conversion input The error occurred at "€" (line: 1389, column: 46).

score 1 · Accepted Answer · answered Jun 25 '12 at 14:03

1

It sounds like Geany is interpreting the file as ISO-8859-1 which, as you say, means it's displaying two characters instead of one.

Two commands that might be helpful: od and iconv. od is "octal dump" which you can use to verify exactly what bytes are in the file. iconv is for converting strings from one encoding to another.

answered Jun 25 '12 at 14:03

chooban

9,018
2
20
36

actually geanie sees it as utf-8. If I lunch file command on the file I get UTF-8 Unicode text, with very long lines – dierre Jun 25 '12 at 15:09
That's strange. I don't suppose you can host the file somewhere so I can try it in Geany myself? – chooban Jun 26 '12 at 08:06
I can't because it contains sensible data. – dierre Jun 26 '12 at 08:10
If I try to convert it to ISO-8859-1 I got this message: Error message: Invalid byte sequence in conversion input The error occurred at "€" (line: 1389, column: 46). – dierre Jun 26 '12 at 08:13
So you see, we have an UTF-8 but with chars with accent in their two-chars-ISO-8859 form. It's freaking weird. – dierre Jun 26 '12 at 08:14
Ah, okay, that looks like a different character. It could be that someone's managed to get a mix of UTF-8 and ISO-8859-1 in there. – chooban Jun 26 '12 at 08:21
Exactly. The problem is that I can't see how I can correct this problem creating a script instead of a doing it by hand. – dierre Jun 26 '12 at 08:25
Depending on the size of the file, doing it by hand might be quicker. Keep running iconv and correct each character you come across. Apart from that, a script which iterates over the file interpreting bytes and handling any ones it doesn't like ("Not a UTF-8 byte stream? Try again as Latin-1 and then convert to UTF-8") is the first idea off the top of my head. – chooban Jun 26 '12 at 08:35
Take a look at this - http://stackoverflow.com/questions/1401317/remove-non-utf8-characters-from-string You could modify that so that when it hits a single byte it doesn't like, it re-interprets the byte as latin-1, converts *that* to UTF-8 and puts it back in the string. Good luck! – chooban Jun 28 '12 at 17:31

score 1 · Answer 2 · answered Aug 15 '12 at 09:52

It seems like bug of Geany. If you open ANSI file (which is created in Windows), Geany is interpreting the file as ISO-8859-1. When you are trying to add some unicode symbols and save it, you get

An error occurred while converting the file from UTF-8 in "ISO-8859-1".

Try Document->Set Encoding->Unicode (UTF-8) and save document. It helps.

bad chars in a UTF-8 file

2 Answers2