1

I need to check Strings in Java that they only contain allowed characters. I need to do it for several different languages. Each of the languages contains a set of special characters that are added to the national alphabet against the basic English alphabet, e.g. in German you have ü, ö etc. I would need to define Java constants containing these special characters for each of the languages in the form of UTF escape sequences so that I can keep my Java file ASCII encoded.

Is there a way where I can get such constants, e.g. download them somewhere or get them from some library or generate them somehow? Any idea? I know I can find the chars somewhere on the net, find the escape sequence for each of the charaters and put it into my source file. Is there some way how to do it with less effort?

TomFT
  • 133
  • 1
  • 13
  • Something similar was discussed [here](https://stackoverflow.com/questions/17575840/better-way-to-generate-array-of-all-letters-in-the-alphabet) and [here](https://stackoverflow.com/questions/61600701/how-to-get-all-national-characters-for-selected-locale) - perhaps those notes can give you some ideas. – andrewJames May 13 '20 at 18:07
  • You can probably do something with Unicode script properties in a regular expression. There's a list of these at http://m.a.gg/manual/de/regexp.reference.unicode.php - quite a lot of scripts for you to choose from. You can just use them directly in a Java regular expression. – Dawood ibn Kareem May 13 '20 at 18:31
  • Thanks for the ideas, they are interesting. The main difference to what I need is that these ideas would help me define characters of a SCRIPT. What I need is characters of a LANGUAGE which is different. E.g. both English and Spanish use LATIN but Spanish has some more characters. In the current time I am ok to stay with LATIN probably but need to differentiate among various languages. – TomFT May 13 '20 at 20:55
  • Sorry, I understand now. I was thinking more in terms of (say) English vs Russian vs Arabic. I'm unaware of any resource that tells you that German has ö and ü; French has â, é and ç; and Croatian has ć, č and đ. My gut tells me you'd want to be very careful with this idea though. Even if I'm writing English, I might want to write piñata or café. – Dawood ibn Kareem May 14 '20 at 00:42
  • Yes, exactly, I agree with you. I know there are exceptions so fully generally this may not be possible to do. But I do not care for these exceptions, they are pretty rare so I need typical national letters like you mentioned in your comment. Since nothing like that probably exists, I will need to do it manually. – TomFT May 17 '20 at 15:46
  • Also I changed the title of the question to be less misleading. Basically this is mainly about latin languages (although in general not but so that anyone understands this question better I added the word LATIN to the question title). – TomFT May 17 '20 at 15:49

1 Answers1

0

Each Unicode character has a canonical name, and it is possible to look them up by name:

StringBuilder sb = new StringBuilder();
sb.appendCodePoint(Character.codePointOf("LATIN SMALL LETTER O WITH DIAERESIS"));
sb.appendCodePoint(Character.codePointOf("LATIN SMALL LETTER U WITH DIAERESIS"));
System.out.println(sb);

Output:

öü

Beware casting the return value of codePointOf to a char:

char c = (char) Character.codePointOf("LATIN SMALL LETTER U WITH DIAERESIS");

This will only work for characters in the Basic Multilingual Plane (ones that fit in a single UTF-16 code unit i.e., have a code point less than 65,536).

David Conrad
  • 15,432
  • 2
  • 42
  • 54
  • Yes thanks but my question was not how to encode the characters in Java but how to get a list of such Java constants for various languages somewhere (so that I don't have to generate them manually). – TomFT May 13 '20 at 20:37
  • @TomFT I don't know that there are any lists of characters for *languages*, as opposed to characters for *scripts*. – David Conrad May 13 '20 at 22:42