0

I have a string read as a UTF8 (not from a file, can't check BOM). The problem is that sometimes the original text was formed with another encoding, but was converted to UTF8 - so the string is not readable, sort of gibberish.

is it possible to detect that this string is not actual UTF8?
Thanks!

captain dragon
  • 1,289
  • 2
  • 9
  • 8
  • 1
    If it comes out gibberish, it wasn't "converted" to UTF8 in any sensible way. What really happened to it? Where did it go from bytes to a string, and what went wrong? – Tim S. Aug 08 '13 at 16:01
  • its not a dup - the string **is** UTF8, but converted from another encoding, so the result is gibberish. How do you detect it? – captain dragon Aug 08 '13 at 16:03
  • Example: "беÑплаÑноe ÑканиÑованиe" – captain dragon Aug 08 '13 at 16:04
  • As far as I know, you cannot determine how a string has been encoded by the string itself. Can you show the code, on which the issue occurs? – Kai Hartmann Aug 08 '13 at 16:07

1 Answers1

1

No. They're just bytes. You could try to guess, if you wanted, by trying different conversions and seeing whether there are valid dictionary words, etc., but in a theoretical sense it's impossible without knowing something about the data itself, i.e. knowing that it never uses certain characters, or always uses certain characters, or that it contains mostly words found in a given dictionary, etc. It might look like gibberish to a person, but the computer has no way of quantifying "gibberish".

Servy
  • 202,030
  • 26
  • 332
  • 449