1

I would really like to get if the file is Windows-1256 or not. Is there a way to recognize if text file is Windows-1256 in Java?

kleopatra
  • 51,061
  • 28
  • 99
  • 211
  • 3
    Usually you can only detect which encodings text is not, by bytes which are not valid in a given encoding. e.g. a block of plain ASCII text could be any number of encodings originally (though it shouldn't matter which one it was) – Peter Lawrey Apr 16 '12 at 07:28

3 Answers3

0

You could use this API to check the encoding:

http://jchardet.sourceforge.net/

And have a look at this question:

Java : How to determine the correct charset encoding of a stream

Community
  • 1
  • 1
SWoeste
  • 1,137
  • 9
  • 20
0

Add an encoding header to the file. Many text editors do this:

# -*- coding: cp1256 -*-

Other than that, there is no reliable way to do this.

The problem is that the cp12xx encodings aren't very different from each other. They look different on the screen but in the data of the files, there is nothing which says 0x8a means arabic ٹ (1256) or Š (1250 and 1252) or nothing (1255).

PS: the last sentence looks wrong because of right-to-left issues. The code "(1256)" is actually after the arabic character.

Aaron Digulla
  • 321,842
  • 108
  • 597
  • 820
0

Say you have the choice of Windows-1256 (Arabic), UTF-8 and Windows-1252 (part of Western Europe). Then you can register proofs of wrong encoding for say UTF-8 (unsensible sequence) and Windows-1252. Some sequences of Windows-1252 would throw an unparsable exception for UTF-8 anyway-

try {
    readInUTF8(file);
} catch (IsWindows1256Exception e {
    readInWindow1256(file);
}

(Pseudo-code)

Joop Eggen
  • 107,315
  • 7
  • 83
  • 138