I am investigating some mess that has been done to our languages-support (it is used in our IDN functionality, if that rings a bell)...
I used an SQL GUI client to quickly see the structure of our language definitions. So, when I do select charcodes from ourCharCodesTable where language = 'myLanguage';
, I get results for some values of 'myLanguage'
, E.G.:
myLanguage = "ASCII"
:
result = "-0123456789abcdefghijklmnopqrstuvwxyz"
myLanguage = "Russian"
:
result = "-0123456789абвгдежзийклмнопрстуфхцчшщъьюяѐѝ"
(BTW: can already see a language mistake here, if you are a polyglot like me!)
I thought: "OK, I can work with this! Let's write a Java program and put some logic to find mistakes..."
I need my logic to receive one char at a time from the 'result' and, according to the current table context, apply my logic to flag if it should or should not be there...
However! When I am at:
myLanguage = "Belarusian"
:
One would think this language is rather similar to Russian, but the very format of the result, as coming from the database is totally different: result = "U+002D\nU+0030\nU+0030..."
!
And, there's another format!
myLanguage = "Chinese"
:
result = "#\nU+002D;U+002D;U+003D,U+004D,U+002D\nU+0030;U+0030;U+0030"
FWIW: charcodes column is of CLOB type.
I know U+002D
is '-' and U+0030
is '0'...
My current idea is to:
1] Check if the entire response is in 'щ' format or 'U+0449` format (whether the 'U+****'s are separated with ';', ',' or '\n' - I am just going to treat them as standalone chars)
a. If it is the "easy one", just send the char on to my testing method
b. If it is the "hard one", get the hex part (0449), convert to decimal (1097) and cast to char (щ)
So, again, my questions are:
- What is this "U+043E;U+006F,U+004D" format?
- If it is a widely-used standard, does Java offer any methods to convert a whole String of these into a char array?