0

Java : How to determine the correct charset encoding of a stream

i want to get the file encoding type runtime for pertiqular file.

System.getProperties("file.encoding");

the above code display the same encoding type for all input file.

Community
  • 1
  • 1
Rakesh Patel
  • 393
  • 2
  • 10

3 Answers3

2

See Marcelos comment - there are some libraries you can use to guess the encoding of a file, but you can never determine it for sure, unless you know before-hand. There is no "standard" information in arbitrary text-files to indicate which encoding has been used to write it. Specific file formats may include encoding information, but that would be in some proprietary way, specific to that file format.

pap
  • 27,064
  • 6
  • 41
  • 46
  • ok i will increase but the suggested library is not get the encoding type for japanese excel file.i have tried it is not give aby encoding type for excel file – Rakesh Patel Mar 14 '12 at 11:40
0

System.getProperty("file.encoding") returns your os default encoding. You cannot read out the encoding from a text file, but you can set the encoding explicitly when writing files, to make sure the right encoding is set.

Stefan
  • 12,108
  • 5
  • 47
  • 66
0

"file.encoding" property is the default encoding wich will be applied when your text will be saved to file.

There is no standard way to recognize text encoding if the text does not contain some encoding info (like xml files do)

My way of detecting plain text encoding is as follows:

Russian text may come in following encodings: cp1251, dos866, unicode, koi-8 For each russian letter there are combination with others letters that never can be seen in text. E.g. after letter 'а' you'll never see any of "ъ, ы, ь".

For every letter i have such set of "impossible letters after". Then i load the file content in every encoding (may load not full text, but some resonable chunk of bytes) and for the text i count how many impossible combinations i've got. The winner is encoding in wich this number is the least. And, of couse, i count chars that come out of the alphabet diapazone, as errors too. Text can contain mistakes, so thare may be errorCount>0 for the right encoding, but for reasonable chunk of text it works quite accurate - the right encoding counts always the least errorCount.

May be you will find this useful somehow.

yggdraa
  • 2,002
  • 20
  • 23