7
public static int getWordCount(String sentence) {
    return sentence.split("(([a-zA-Z0-9]([-][_])*[a-zA-Z0-9])+)", -1).length
         + sentence.replaceAll("([[a-z][A-Z][0-9][\\W][-][_]]*)", "").length() - 1;
}

My intention is to count the number of words in a sentence. The input to this function is the the lengthy sentence. It may have 255 words.

  1. The word should take hyphens or underscores in between
  2. Function should only count valid words means special character should not be counted eg. &&&& or #### should not count as a word.

The above regular expression is working fine, but when hyphen or underscore comes in between the word eg: co-operation, the count returning as 2, it should be 1. Can anyone please help?

Willem Van Onsem
  • 443,496
  • 30
  • 428
  • 555
neena
  • 551
  • 1
  • 8
  • 24
  • Wow that is one of the worst concatenation I've ever seen so far. – Murat Karagöz Jun 16 '15 at 11:31
  • 3
    what's wrong with the meta symbol \w ? – Sharon Ben Asher Jun 16 '15 at 11:31
  • Where did you get this regex? – Pshemo Jun 16 '15 at 11:33
  • @sharonbn `\w` doesn't match the hypen, but of course `[-\w]` does. Or in case of the above `(\w+(-?\w+)*)` or similar. – dhke Jun 16 '15 at 11:35
  • The split is a quite expensive operation. – Willem Van Onsem Jun 16 '15 at 11:37
  • I am new to regular expression, I wrote this myself. If there is any mistakes correct me. This expression is working for all words and not counting special character too, my problem is with hyphen and underscore – neena Jun 16 '15 at 11:37
  • It is not a mistake, but more like bad style. `[[a-z][A-Z][0-9][\\W][-][_]]` is the same as `[a-zA-Z0-9\\W\\-_]` which is kind of easier to read (and see that this regex will match every character because of combination of `a-zA-Z0-9_` with `\W` which is probably not what you want). Also `([-][_])*` is same as `(-_)*`. – Pshemo Jun 16 '15 at 11:41

4 Answers4

10

Instead of using .split and .replaceAll which are quite expensive operations, please use an approach with constant memory usage.

Based on your specifications, you seem to look for the following regex:

[\w-]+

Next you can use this approach to count the number of matches:

public static int getWordCount(String sentence) {
    Pattern pattern = Pattern.compile("[\\w-]+");
    Matcher  matcher = pattern.matcher(sentence);
    int count = 0;
    while (matcher.find())
        count++;
    return count;
}

online jDoodle demo.

This approach works in (more) constant memory: when splitting, the program constructs an array, which is basically useless, since you never inspect the content of the array.

If you don't want words to start or end with hyphens, you can use the following regex:

\w+([-]\w+)*
Community
  • 1
  • 1
Willem Van Onsem
  • 443,496
  • 30
  • 428
  • 555
3

This part ([-][_])* is wrong. The notation [xyz] means "any single one of the characters inside the brackets" (see http://www.regular-expressions.info/charclass.html). So effectively, you allow exactly the character - and exactly the character _, in that order.

Fixing your group makes it work:

[a-zA-Z0-9]+([-_][a-zA-Z0-9]+)*

and it can be further simplified using \w to

\w+(-\w+)*

because \w matches 0..9, A..Z, a..z and _ (http://www.regular-expressions.info/shorthand.html) and so you only need to add -.

Jongware
  • 22,200
  • 8
  • 54
  • 100
2

if you can use java 8:

long wordCount = Arrays.stream(sentence.split(" ")) //split the sentence into words   
.filter(s -> s.matches("[\\w-]+")) //filter only matching words
.count();
griFlo
  • 2,084
  • 18
  • 28
0

With java 8

public static int getColumnCount(String row) {
    return (int) Pattern.compile("[\\w-]+")
        .matcher(row)
        .results()
        .count();
}
Jakub Krhovják
  • 136
  • 1
  • 4