0

I'm looking for a UTF-8 table / tables etc. with all the lowercase (small) and uppercase (capital) characters in (hexa-)decimal form and with a relation between the elements. So far I found:

these are nice lists, though:

  • there is no relation between the 2 tables (I could create this by means some scripting creating a new table)
  • the values are Unicode values and not utf-8 (hexa-)decimal values.

Any suggestions?

albert
  • 8,285
  • 3
  • 19
  • 32
  • 1
    Use an existing library. – Allan Wind Feb 24 '21 at 12:05
  • 1
    *"the values are Unicode values and not utf-8"* — Well, the algorithm to encode Unicode codepoints to UTF-8 is *UTF-8*… – deceze Feb 24 '21 at 12:06
  • @AllanWind Nice but which one? – albert Feb 24 '21 at 12:06
  • @deceze can you give an example? – albert Feb 24 '21 at 12:07
  • 1
    Which language ? ICU, for instance, would be an option for C++. This doesn't answer your question, but might solve the real problem you are trying to address. – Allan Wind Feb 24 '21 at 12:07
  • The language I would use is c++, I tried already libraries like ICU but they need a locale and this I cannot get to work in any way with hexadecimal values. – albert Feb 24 '21 at 12:11
  • 2
    Well, case conversation is locale dependent. What do you mean with the "I cannot get to work in any way with hexadecimal values."? Also, if you haven't check out https://stackoverflow.com/questions/39560894/how-to-convert-unicode-characters-to-uppercase-in-c – Allan Wind Feb 24 '21 at 12:15
  • When I have a character like `unsigned char x[]= {0xc, 0xe94,0}` (i.e. 'GREEK CAPITAL LETTER DELTA') I need to get `0xce 0xb4`, Or having `0xF0 0x9E 0xA4 0xA1` (i.e. 'ADLAM CAPITAL LETTER SHA') I need to get `0xF0 0x9E 0xA5 0x83 ` and both can be (with other character types in one document). – albert Feb 24 '21 at 12:23
  • 3
    Note: one codepoint lower may be written by two or more codepoint upper case (and the contrary). It may depend on the context (letters nearby). And it depends on language. So use a library of be ready to read a lot of documentation on all complexities of the task – Giacomo Catenazzi Feb 24 '21 at 12:24
  • @GiacomoCatenazzi I know it is, unfortunately, very complex business but that is inherent to the natural languages. For the mapping are you referring e.g. to LATIN CAPITAL LETTER DZ WITH CARON and friends (0xc7 0x84; 0xc7 0x85;0xc7 0x86) – albert Feb 24 '21 at 12:30
  • 1
    In any case UCD has one of such list, for the simple cases, check `UnicodeData.txt`, last fields. Then there is an auxiliary file about the special cases. -- see https://www.unicode.org/Public/UCD/latest/ucd/ and http://www.unicode.org/reports/tr44/ for field description) – Giacomo Catenazzi Feb 24 '21 at 12:43
  • @GiacomoCatenazzi that is a very good source file, I saw also the file https://www.unicode.org/Public/UCD/latest/ucd/CaseFolding.txt at that place. No I just have to go from the unicode value the hexadecimal value. – albert Feb 24 '21 at 12:51

1 Answers1

0

My advise is to use an existing library that can do case conversion for you. You mentioned you were targeting C++ so the library ICU is one option.

Allan Wind
  • 23,068
  • 5
  • 28
  • 38
  • Dee my comment wit the question about IC and the comment: When I have a character like unsigned char x[]= {0xc, 0xe94,0} (i.e. 'GREEK CAPITAL LETTER DELTA') I need to get 0xce 0xb4, Or having 0xF0 0x9E 0xA4 0xA1 (i.e. 'ADLAM CAPITAL LETTER SHA') I need to get 0xF0 0x9E 0xA5 0x83 and both can be (with other character types in one document). – albert Feb 24 '21 at 12:30