5

I need some regex that given the following string:

"test test3 t3st test: word%5 test! testing t[st"

will match only words in a-z chars:

Should match: test testing

Should not match: test3 t3st test: word%5 test! t[st

I have tried ([A-Za-z])\w+ but word%5 should not be a match.

TylerH
  • 20,799
  • 66
  • 75
  • 101
Digao
  • 520
  • 8
  • 22

2 Answers2

4

You may use

String patt = "(?<!\\S)\\p{Alpha}+(?!\\S)";

See the regex demo.

It will match 1 or more letters that are enclosed with whitespace or start/end of string locations. Alternative pattern is either (?<!\S)[a-zA-Z]+(?!\S) (same as the one above) or (?<!\S)\p{L}+(?!\S) (if you want to also match all Unicode letters).

Details:

  • (?<!\\S) - a negative lookbehind that fails the match if there is a non-whitespace char immediately to the left of the current location
  • \\p{Alpha}+ - 1 or more ASCII letters (same as [a-zA-Z]+, but if you use a Pattern.UNICODE_CHARACTER_CLASS modifier flag, \p{Alpha} will be able to match Unicode letters)
  • (?!\\S) - a negative lookahead that fails the match if there is a non-whitespace char immediately to the right of the current location.

See a Java demo:

String s = "test test3 t3st test: word%5 test! testing t[st";
Pattern pattern = Pattern.compile("(?<!\\S)\\p{Alpha}+(?!\\S)");
Matcher matcher = pattern.matcher(s);
while (matcher.find()){
    System.out.println(matcher.group(0)); 
} 

Output: test and testing.

Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
  • Thanks Wiktor, and what you be the the regex to match the opposite of this ? I mean, the rest of the string not matched ? – Digao Jul 21 '17 at 14:19
  • 1
    @Digao: Sorry, what would be the output then? 2 items: `["test3 t3st test: word%5 test!", "t[st]"]` or 6 items `["test3", "t3st", "test:", "word%5", "test!", "t[st"]`? – Wiktor Stribiżew Jul 21 '17 at 14:22
  • 1
    I suspect you want [this](http://ideone.com/mIvAox) to get the "opposite" results. – Wiktor Stribiżew Jul 21 '17 at 14:28
  • 1
    Looks like there is a way to match those items without lookaheads, you may also use [`"(?:\\S*[^\\s\\p{Alpha}])+\\S*"`](http://ideone.com/ycFtx2). It just matches any chunks of non-whitespace chars with an obligatory char that is not whitespace and not a letter. – Wiktor Stribiżew Jul 21 '17 at 14:39
1

Try this

Pattern tokenPattern = Pattern.compile("[\\p{L}]+");

[\\p{L}]+ this prints group of letters

Jobin
  • 5,610
  • 5
  • 38
  • 53
Rajani
  • 25
  • 6