2

In Perl, I usually use the transliteration to count the number of characters in a string that match a set of possible characters. Things like:

$c1=($a =~ y[\x{0410}-\x{042F}\x{0430}-\x{044F}]
            [\x{0410}-\x{042F}\x{0430}-\x{044F}]);

would count the number of Cyrillic characters in $a. As in the previous example I have two classes (or two ranges, if you prefer), I have some other with some more classes:

$c4=($a =~ y[\x{AC00}-\x{D7AF}\x{1100}-\x{11FF}\x{3130}-\x{318F}\x{A960}-\x{A97F}\x{D7B0}-\x{D7FF}]
            [\x{AC00}-\x{D7AF}\x{1100}-\x{11FF}\x{3130}-\x{318F}\x{A960}-\x{A97F}\x{D7B0}-\x{D7FF}]);

Now, I need to do a similar thing in Java. Is there a similar construct in Java? Or I need to iterate over all characters, and check if it is between the limits of each class?

Thank you

ikegami
  • 367,544
  • 15
  • 269
  • 518
Alberto
  • 499
  • 4
  • 23

3 Answers3

1

Haven't seen anything like tr/// in Java.

You could use something like this to count all the matches tho:

Pattern p = Pattern.compile("[\\x{0410}-\\x{042F}\\x{0430}-\\x{044F}]", 
                            Pattern.CANON_EQ);
Matcher m = p.matcher(string);
int count = 0;
while (m.find())
    count++;
Alberto
  • 499
  • 4
  • 23
Qtax
  • 33,241
  • 9
  • 83
  • 121
  • This will be (probably) slower than a chain of if's checking each character. But would help to maintain parallel code in Java and Perl. Thank you. – Alberto Jul 31 '14 at 13:37
1

You can try to play with something like this:

s.replaceAll( "[^\x{0410}-\x{042F}\x{0430}-\x{044F}]*([\x{0410}-\x{042F}\x{0430}-\x{044F}])?", "$1" ).length()

The idea was borrowed from here: Simple way to count character occurrences in a string

Community
  • 1
  • 1
Oleg Gryb
  • 5,122
  • 1
  • 28
  • 40
1

For good order: using the Java Unicode support.

int countCyrillic(String s) {
    int n = 0;
    for (int i = 0; i < s.length(); ) {
        int codePoint = s.codePointAt(i);
        i += Character.charCount(codePoint);
        if (UnicodeScript.of(codePoint) == UnicodeScript.CYRILLIC) {
            ++n;
        }
    }
    return n;
}

This uses the full Unicode (where two 16 bit chars may represent a Unicode "code point." And in Java the class Character.UnicodeScript has already everything you need.

Or:

int n = s.replaceAll("\\P{CYRILLIC}", "").length();

Here \\P is the negative of \\p{CYRILLIC} the Cyrillic group.

Joop Eggen
  • 107,315
  • 7
  • 83
  • 138