3

I have a function to check if a string(most of the string is only with one CJK char) is only with word chars, and it will be invoked many many times, so the cost is unacceptable, but I don't know how to optimize it, any suggestions?

/*\w is equivalent to the character class [\p{Ll}\p{Lu}\p{Lt}\p{Lo}\p{Nd}].
 For more details see Unicode TR-18, and bear in mind that the set of characters
 in each class can vary between Unicode releases.*/
private static final Pattern sOnlyWordChars = Pattern.compile("\\w+");

private boolean isOnlyWordChars(String s) {
    return sOnlyWordChars.matcher(s).matches();
}

when s is "3g", or "go_url", or "hao123", isOnlyWordChars(s) should return true.

iclinux
  • 492
  • 2
  • 17

4 Answers4

4
private boolean isOnlyWordChars(String s) {
    char[] chars = s.toCharArray();    
    for (char c : chars) {
        if(!Character.isLetter(c)) {
            return false;
        }
    }    
    return true;
}

A better implementation

public static boolean isAlpha(String str) {
    if (str == null) {
        return false;
    }
    int sz = str.length();
    for (int i = 0; i < sz; i++) {
        if (Character.isLetter(str.charAt(i)) == false) {
            return false;
        }
    }
    return true;
}

Or if you are using Apache Commons, StringUtils.isAlpha(). the second implemenation of the answer is actually from the source code if isAlpha.

UPDATE

HI Sorry for the late reply. I wasn't pretty sure about the speed although I read in several places that loop is faster than regex. To be sure I run the following codes in ideoone and here is the result

for 5000000 iteration

with your codes: 4.99 seconds (runtime error after that so for big data it is not working)

with my first code 2.71 seconds

with my second code 2.52 seconds

for 500000 iteration

with your codes: 1.07 seconds

with my first code 0.36 seconds

with my second code 0.33 seconds

Here is the sample code I used.

N.B. There might be small mistakes. You can play with it to test in different scenario. according to the comment of Jan, I think those are minor things like using private or public. yest condition checking is a nice point.

stinepike
  • 54,068
  • 14
  • 92
  • 112
  • 4
    multiple return and == false is better implementation? – nachokk Jun 21 '13 at 03:15
  • 1
    @nachokk depends on what your common case is; there's a good chance the first `if` won't even get compiled. – Mike Strobel Jun 21 '13 at 03:47
  • I updated my questions, "\w is equivalent to the character class [\p{Ll}\p{Lu}\p{Lt}\p{Lo}\p{Nd}]." e.g. when the string is "3g", or "go_url", or "hao123", isOnlyWordChars(s) should return true. – iclinux Jun 21 '13 at 04:29
  • Why is the first one `private` and the second one `public static`? – John Dvorak Jun 21 '13 at 04:41
  • Also, Nacokk's point was that you should use `!condition` instead of `condition == false` – John Dvorak Jun 21 '13 at 04:43
  • @StinePike, this implementation doesn't solve my problem, do you have a better solution? – iclinux Jun 22 '13 at 05:35
  • @iclinux .. are you facing problem in using isalpha method? – stinepike Jun 22 '13 at 05:39
  • @StinePike, yes, when str is "3_g", or "go_url", or "hao_123", isOnlyWordChars(str) should all return true, but yours won't...because Character.isLetter() don't equal "\w"(which is equivalent to the character class [\p{Ll}\p{Lu}\p{Lt}\p{Lo}\p{Nd}].) – iclinux Jun 22 '13 at 05:51
  • @StinePike yes, is there a way without regx to test if a char is or isnot a "\w"? – iclinux Jun 22 '13 at 05:56
  • Character.isLetter(str.charAt(i)) in place of this condition you can create your own condition then .. including what you want and excluding what you dont want.. so crate your own isLetter method – stinepike Jun 22 '13 at 05:57
  • the problem is that, I don't know how to write my own isLetter(char), because the char maybe CJK or in other languages... – iclinux Jun 22 '13 at 06:03
1

The only thing i see is to change your pattern to:

^\\w++$

but i am not a java expert

explanations:

I have added anchors (ie ^ $) that increases the performances of the pattern (the regex engine fails at the first non word character until it encounters the end). I have added a possessive quantifier (ie ++), then the regex engine doesn't matter of backtrack positions and is more fast.

more informations here.

Casimir et Hippolyte
  • 88,009
  • 5
  • 94
  • 125
1

I think that the chief problem is your pattern.

I was working through an iterative solution, when I noticed that it failed on one of my test strings Supercalifragilisticexpalidociou5. This reason for this: \w+ only cares if there is one or more word characters. It doesn't care if you're not looking at a word character beyond what it's already matched.

To rectify this, use a lookaround:

(?!\W+)(\w+)

The \W+ condition will lock the regex if one or more characters are found to be a non-word character (such as &*()!@!#$).

Makoto
  • 104,088
  • 27
  • 192
  • 230
  • That look-ahead will never filter anything. A sequence of non-word-chars never begins where a sequence of word-chars does. Perhaps you wanted the lookahead to be `(?!\w*\W+)` and also to add string anchors? – John Dvorak Jun 21 '13 at 04:39
  • The explanation is correct, but the rectification is insufficient. – John Dvorak Jun 21 '13 at 04:42
1

If you want to do this using regexes, then the most efficient way do it is to change the logic to a negation; i.e. "every character is a letter" becomes "no character is a non-letter".

private static final Pattern pat = Pattern.compile("\\W");

private boolean isOnlyWordChars(String s) {
    return !pat.matcher(s).find();
}

This will test each character at most once ... with no backtracking.

Stephen C
  • 698,415
  • 94
  • 811
  • 1,216
  • thanks, but It doesn't works in my case, because in my case, nearly all the "s" is only with one cjk char... old method costs [61032]ns, yours costs [58115]ns in average. – iclinux Jun 21 '13 at 07:13
  • is there's a way to check one char is "\w" without regx? because most of the string is with length 1. – iclinux Jun 22 '13 at 05:37
  • Yea, sure there is. Use a loop like StinePike's answer. – Stephen C Jun 22 '13 at 05:48
  • no, I can't, because Character.isLetter() don't equal "\w"(which is equivalent to the character class [\p{Ll}\p{Lu}\p{Lt}\p{Lo}\p{Nd}].) – iclinux Jun 22 '13 at 05:54