i have no idea of how to add multibyte encoding support and very little knowledge on multibyte languages. Being working on a search engine, my application scans code in all programming languages. Some sourcecode might have CJK encoding in their comments section. For easiness sake, i take java as source-code sample and my application is also in java.
First thing, i want to write test cases to see if to-be-indexed source-code has CJK encoding and if it is encoded by my application. I want my tests to fail if support not included so that can be added in future.
But i have no idea how to test it , how to entre CJK in input samples for unit test and what would be output in Java application console.