2

For example if I know that ć should be ć, how can I find out the codepage transformation that occurred there?

It would be nice if there was an online site for this, but any tool will do the job. The final goal is to reverse the codepage transformation (with iconv or recode, but tools are not important, I'll take anything that works including python scripts)

EDIT:

Could you please be a little more verbose? You know for certain that some substring should be exactly. Or know just the language? Or just guessing? And the transformation that was applied, was it correct (i.e. it's valid in the other charset)? Or was it single transformation from charset X to Y but the text was actually in Z, so it's now wrong? Or was it a series of such transformations?

Actually, ideally I am looking for a tool that will tell me what happened (or what possibly happened) so I can try to transform it back to proper encoding.

What (I presume) happened in the problem I am trying to fix now is what is described in this answer - utf-8 text file got opened as ascii text file and then exported as csv.

Community
  • 1
  • 1
Unreason
  • 12,556
  • 2
  • 34
  • 50
  • Could you please be a little more verbose? You know for certain that some substring should be exactly. Or know just the language? Or just guessing? And the transformation that was applied, was it correct (i.e. it's valid in the other charset)? Or was it single transformation from charset X to Y but the text was actually in Z, so it's now wrong? Or was it a series of such transformations? – Jan Hudec Aug 08 '11 at 08:15
  • The hard part is telling whether you have correct text now or not, which is why I asked whether you know for certain what at least part of the text should have been. It also helps if you can limit what encoding the file was originally and the number of encodings possibly involved in the incorrect conversions. – Jan Hudec Aug 08 '11 at 11:14
  • I don't have correct text now; but I know what it should be (some of the strings are last names that I can get in correct form/encoding). – Unreason Aug 08 '11 at 11:16
  • That is, what parts of it should be, not what the whole text should be. – Unreason Aug 08 '11 at 11:36

2 Answers2

3

It's extremely hard to do this generally. The main problem is that all the ascii-based encodings (iso-8859-*, dos and windows codepages) use the same range of codepoints, so no particular codepoint or set of codepoints will tell you what codepage the text is in.

There is one encoding that is easy to tell. If it's valid UTF-8, than it's almost certainly no iso-8859-* nor any windows codepage, because while all byte values are valid in them, the chance of valid utf-8 multi-byte sequence appearing in a text in them is almost zero.

Than it depends on which further encodings may can be involved. Valid sequence in Shift-JIS or Big-5 is also unlikely to be valid in any other encoding while telling apart similar encodings like cp1250 and iso-8859-2 requires spell-checking the words that contain the 3 or so characters that differ and seeing which way you get fewer errors.

If you can limit the number of transformation that may have happened, it shouldn't be too hard to put up a python script that will try them out, eliminate the obvious wrongs and uses a spell-checker to pick the most likely. I don't know about any tool that would do it.

Jan Hudec
  • 73,652
  • 13
  • 125
  • 172
  • Ok, thanks for the answer (+1), but your suggestion does not use the fact that I know about certain parts of the file what they should be – Unreason Aug 08 '11 at 12:19
  • @Unreason: That's what I asked in the comment to the question. So you do know *exactly* what some sequence should be? That makes the detecting part easy, though you'll still have to try out the combinations. Since you can try them on a short sample only, it's should be doable reasonably fast, depending on how many conversions may have happened. – Jan Hudec Aug 09 '11 at 06:25
0

The tools like that were quite popular decade ago. But now it's quite rare to see damaged text.

As I know it could be effectively done at least with a particular language. So, if you suggest the text language is Russian, you could collect some statistical information about characters or small groups of characters using a lot of sample texts. E.g. in English language the "th" combination appears more often than "ht".

So, then you could permute different encoding combinations and choose the one which has more probable text statistics.

kan
  • 28,279
  • 7
  • 71
  • 101