1

I have a file that geany tells me is a UTF-8 file but with characters like:

ù instead of ù

and so on. That's because 0xC3 + 0xB9 are considered two characters instead of just U+00F9, right? With geany the encoding of the file is already UTF-8, if I switch to ISO-8859 of course I don't get corrected characters.

Is there like a bash command, java class, ruby module or a magic potion that can automatically change this thing without the need to do this by hand?

EDIT:

If i try to switch I can't save the file because I have errors like:

Error message: Invalid byte sequence in conversion input The error occurred at "€" (line: 1389, column: 46).

dierre
  • 7,140
  • 12
  • 75
  • 120

2 Answers2

1

It sounds like Geany is interpreting the file as ISO-8859-1 which, as you say, means it's displaying two characters instead of one.

Two commands that might be helpful: od and iconv. od is "octal dump" which you can use to verify exactly what bytes are in the file. iconv is for converting strings from one encoding to another.

chooban
  • 9,018
  • 2
  • 20
  • 36
  • actually geanie sees it as utf-8. If I lunch file command on the file I get UTF-8 Unicode text, with very long lines – dierre Jun 25 '12 at 15:09
  • That's strange. I don't suppose you can host the file somewhere so I can try it in Geany myself? – chooban Jun 26 '12 at 08:06
  • I can't because it contains sensible data. – dierre Jun 26 '12 at 08:10
  • If I try to convert it to ISO-8859-1 I got this message: Error message: Invalid byte sequence in conversion input The error occurred at "€" (line: 1389, column: 46). – dierre Jun 26 '12 at 08:13
  • So you see, we have an UTF-8 but with chars with accent in their two-chars-ISO-8859 form. It's freaking weird. – dierre Jun 26 '12 at 08:14
  • Ah, okay, that looks like a different character. It could be that someone's managed to get a mix of UTF-8 and ISO-8859-1 in there. – chooban Jun 26 '12 at 08:21
  • Exactly. The problem is that I can't see how I can correct this problem creating a script instead of a doing it by hand. – dierre Jun 26 '12 at 08:25
  • Depending on the size of the file, doing it by hand might be quicker. Keep running iconv and correct each character you come across. Apart from that, a script which iterates over the file interpreting bytes and handling any ones it doesn't like ("Not a UTF-8 byte stream? Try again as Latin-1 and then convert to UTF-8") is the first idea off the top of my head. – chooban Jun 26 '12 at 08:35
  • Take a look at this - http://stackoverflow.com/questions/1401317/remove-non-utf8-characters-from-string You could modify that so that when it hits a single byte it doesn't like, it re-interprets the byte as latin-1, converts *that* to UTF-8 and puts it back in the string. Good luck! – chooban Jun 28 '12 at 17:31
1

It seems like bug of Geany. If you open ANSI file (which is created in Windows), Geany is interpreting the file as ISO-8859-1. When you are trying to add some unicode symbols and save it, you get

An error occurred while converting the file from UTF-8 in "ISO-8859-1".  

Try Document->Set Encoding->Unicode (UTF-8) and save document. It helps.

Myosotis
  • 288
  • 4
  • 15