0

i have no idea of how to add multibyte encoding support and very little knowledge on multibyte languages. Being working on a search engine, my application scans code in all programming languages. Some sourcecode might have CJK encoding in their comments section. For easiness sake, i take java as source-code sample and my application is also in java.

First thing, i want to write test cases to see if to-be-indexed source-code has CJK encoding and if it is encoded by my application. I want my tests to fail if support not included so that can be added in future.

But i have no idea how to test it , how to entre CJK in input samples for unit test and what would be output in Java application console.

misha79
  • 1
  • 1
  • You'd typically want to know what encodings you'll be dealing with in advance as I don't know that there's a straightforward means (if at all) of determining the encoding of a particular file. UTF8 / Unicode are the way to go if you've a choice of input encoding, otherwise you might have to muddle through with user-selectable encodings. – Will A Apr 28 '11 at 18:49
  • All Unicode encodings are multibyte, aren’t they? Anyway, it is not possible to *detect* which encoding you have. You *must* be told in which encoding the data should be treated. – tchrist Apr 28 '11 at 19:10
  • It’s not quite clear what you’re asking here. Do you want help with reading files in a specific character encoding, with determining the character encoding of a file, with creating a file using a specific character encoding in order to test your work, or with some combination of those things? – Daniel Cassidy Apr 28 '11 at 19:11

1 Answers1

0

The presence of a Byte Order Mark might be of use, but they are optional. There are other methods for determining the encoding when UTF is used. This may be of use: Java : How to determine the correct charset encoding of a stream.

Community
  • 1
  • 1
Daniel Renshaw
  • 33,729
  • 8
  • 75
  • 94