0

I have a lot of characters I wish to convert to readable characters. I do not know what kind of format my characters are or from where they come ( old code ). How can I convert these characters in readable characters ?

I found some characters ( but not all ) on the following List:

http://www.utf8-zeichentabelle.de/unicode-utf8-table.pl?utf8=char

My characters in for me unknown charactersset:

static String unknowncharacters[] = {"–", "’", "–", "–", "’", "ž", "–", "ž", "’", "ž", "'", "´", "é", "é", "ß", "?", "š", "–", "ł",
            "ø", "á", "ñ","ș","ë","�","ồ","à","½","í","ı","ú","�","ò","š","ó","Æ","�","Ḥ","ī","ū","�","æ"}

How can I programmatically convert my characters in JAVA , so I will get all my unknown characters.

mcfly soft
  • 11,289
  • 26
  • 98
  • 202
  • 2
    It sounds like this is probably a case of using the wrong encoding when reading your data to start with, and is best fixed *not* by handling the badly-read version, but by reading with the right encoding to start with. Unfortunately we know nothing of your data storage at the moment... – Jon Skeet Feb 04 '14 at 15:58
  • Yes that would be could. But I really want to convert this characters as the question says. Not looking for better concepts or better solutions :-) , because I have no chance to change that ( old code ) – mcfly soft Feb 04 '14 at 16:40
  • Take a look at this post, it might help you: [Convert UTF-8 to ISO-8859-1][1] [1]: http://stackoverflow.com/questions/655891/converting-utf-8-to-iso-8859-1-in-java-how-to-keep-it-as-single-byte – Alvin Bunk Feb 04 '14 at 17:04
  • `"–"` - looks like UTF-8 encoded dash U+2013 (bytes `e2 80 93`) being decoded as Windows-1252. – McDowell Feb 04 '14 at 17:08
  • @user1344545: So you don't care if you've actually lost data, that you could recover by going back to the original binary data and reading it with the proper encoding? I'm not suggesting changing the old *code* - I'm suggesting changing how you're reading the data it produced. Although if you could tell us more about the old code and how it wrote the data, that would suggest better approaches to reading the data too. – Jon Skeet Feb 04 '14 at 19:37
  • Thanks to Alvin and McDowell. I guess this helps if I would understand :-). I'll try to do this in JAVA so the result would be like the helpfull hint from Joop Eggen. – mcfly soft Feb 05 '14 at 06:16
  • @Jon Skeet: Thanks for replying. Believe me, I can't read it differently. It would go to far to explain the hole project situation. I would accept to loose data, but I would like to check that technically first. So my question is howto convert the character '–' so it looks like '–' in JAVA ? I did not manage it so far. – mcfly soft Feb 05 '14 at 06:30
  • @user1344545: I would be *really* surprised if you couldn't read it differently. But if you can't be bothered to explain your situation, I won't be able to help you. Good luck - and do reply again if you feel able to give us the relevant context. (I doubt that you need to explain your whole project. Just showing the code you're using to read the data would be a good start. We don't even know where it's stored at the moment...) – Jon Skeet Feb 05 '14 at 06:45
  • @Jon Skeet. You are right I will loose information according the following table http://www.string-functions.com/encodingtable.aspx?encoding=65001&decoding=1250. I have to live with that. I will accept the answer from Joop, who helped me translate the characters. THANKS to all :-) – mcfly soft Feb 05 '14 at 13:18
  • It's entirely possible that you *don't* have to live with that, which is why I was asking for more information. I still don't see why you can't even tell us the basic information about how the data is stored and how you're reading it. But hey... – Jon Skeet Feb 05 '14 at 13:19

1 Answers1

0

Probably you got this far: saved as Windows-1252 aka Windows Latin-1, and reread as UTF-8. Then I got still a partial mess.

static String unknowncharacters[] =
{"–", "", "", "", "", "", "", "", "’", "ž", "'", "´", "é", "é", "ß", "?", "", "", "ł",
 "ø", "á", "ñ", "ș", "ë", "Ŀ", "ồ", "à", "½", "í", "ı", "ú", "ſ", "ò", "š", "ó", "Æ", "Ŀ", "Ḥ", "ī", "ū", "ſ", "æ"};

Might these be collected "misspelled" characters from miscellaneous text sources? So multiple encodings, Windows, Mac, maybe even DOS. I believe to originally see a , Spanish, Dutch, Czech, German, French, Turkish here.

Best would be to make a list of encodings and try every encoding per character.

Joop Eggen
  • 107,315
  • 7
  • 83
  • 138