Multibyte encoding in java

Question

i have no idea of how to add multibyte encoding support and very little knowledge on multibyte languages. Being working on a search engine, my application scans code in all programming languages. Some sourcecode might have CJK encoding in their comments section. For easiness sake, i take java as source-code sample and my application is also in java.

First thing, i want to write test cases to see if to-be-indexed source-code has CJK encoding and if it is encoded by my application. I want my tests to fail if support not included so that can be added in future.

But i have no idea how to test it , how to entre CJK in input samples for unit test and what would be output in Java application console.

You'd typically want to know what encodings you'll be dealing with in advance as I don't know that there's a straightforward means (if at all) of determining the encoding of a particular file. UTF8 / Unicode are the way to go if you've a choice of input encoding, otherwise you might have to muddle through with user-selectable encodings. — Will A, Apr 28 '11 at 18:49
All Unicode encodings are multibyte, aren’t they? Anyway, it is not possible to *detect* which encoding you have. You *must* be told in which encoding the data should be treated. — tchrist, Apr 28 '11 at 19:10
It’s not quite clear what you’re asking here. Do you want help with reading files in a specific character encoding, with determining the character encoding of a file, with creating a file using a specific character encoding in order to test your work, or with some combination of those things? — Daniel Cassidy, Apr 28 '11 at 19:11

score 0 · Answer 1 · edited May 23 '17 at 12:19

0

The presence of a Byte Order Mark might be of use, but they are optional. There are other methods for determining the encoding when UTF is used. This may be of use: Java : How to determine the correct charset encoding of a stream.

edited May 23 '17 at 12:19

Community

1
1

answered Apr 28 '11 at 18:54

Daniel Renshaw

33,729
8
75
94

Multibyte encoding in java

1 Answers1