How to mimic transliterate in Java?

Question

In Perl, I usually use the transliteration to count the number of characters in a string that match a set of possible characters. Things like:

$c1=($a =~ y[\x{0410}-\x{042F}\x{0430}-\x{044F}]
            [\x{0410}-\x{042F}\x{0430}-\x{044F}]);

would count the number of Cyrillic characters in $a. As in the previous example I have two classes (or two ranges, if you prefer), I have some other with some more classes:

$c4=($a =~ y[\x{AC00}-\x{D7AF}\x{1100}-\x{11FF}\x{3130}-\x{318F}\x{A960}-\x{A97F}\x{D7B0}-\x{D7FF}]
            [\x{AC00}-\x{D7AF}\x{1100}-\x{11FF}\x{3130}-\x{318F}\x{A960}-\x{A97F}\x{D7B0}-\x{D7FF}]);

Now, I need to do a similar thing in Java. Is there a similar construct in Java? Or I need to iterate over all characters, and check if it is between the limits of each class?

Thank you

score 1 · Accepted Answer · edited Jul 31 '14 at 13:44

1

Haven't seen anything like tr/// in Java.

You could use something like this to count all the matches tho:

Pattern p = Pattern.compile("[\\x{0410}-\\x{042F}\\x{0430}-\\x{044F}]", 
                            Pattern.CANON_EQ);
Matcher m = p.matcher(string);
int count = 0;
while (m.find())
    count++;

edited Jul 31 '14 at 13:44

Alberto

499
4
23

answered Jul 31 '14 at 13:35

Qtax

33,241
9
83
121

This will be (probably) slower than a chain of if's checking each character. But would help to maintain parallel code in Java and Perl. Thank you. – Alberto Jul 31 '14 at 13:37

score 1 · Answer 2 · edited May 23 '17 at 12:22

1

You can try to play with something like this:

s.replaceAll( "[^\x{0410}-\x{042F}\x{0430}-\x{044F}]*([\x{0410}-\x{042F}\x{0430}-\x{044F}])?", "$1" ).length()

The idea was borrowed from here: Simple way to count character occurrences in a string

edited May 23 '17 at 12:22

Community

1
1

answered Jul 31 '14 at 13:41

Oleg Gryb

5,122
1
28
40

Yes, it makes sense. Probably slower, as it constructs a new string. But only testing. – Alberto Jul 31 '14 at 17:00

score 1 · Answer 3 · answered Jul 31 '14 at 14:07

For good order: using the Java Unicode support.

int countCyrillic(String s) {
    int n = 0;
    for (int i = 0; i < s.length(); ) {
        int codePoint = s.codePointAt(i);
        i += Character.charCount(codePoint);
        if (UnicodeScript.of(codePoint) == UnicodeScript.CYRILLIC) {
            ++n;
        }
    }
    return n;
}

This uses the full Unicode (where two 16 bit chars may represent a Unicode "code point." And in Java the class Character.UnicodeScript has already everything you need.

Or:

int n = s.replaceAll("\\P{CYRILLIC}", "").length();

Here \\P is the negative of \\p{CYRILLIC} the Cyrillic group.

How to mimic transliterate in Java?

3 Answers3