You need to be clear about some things. What is a "word"? You want to find only "words" starting with a letter, so I assume that words can have other characters too. But what chars are allowed? What defines the start of such a word? Whitespace, any non letter, any non letter/non digit, ...?
e.g.:
String TestInput = "test séntènce îwhere I'm want,to üfind 1words starting $with le11ers.";
String regex = "(?<=^|\\s)\\pL\\w*";
Pattern p = Pattern.compile(regex, Pattern.UNICODE_CHARACTER_CLASS);
Matcher matcher = p.matcher(TestInput);
while (matcher.find()) {
System.out.println(matcher.group());
}
The regex (?<=^|\s)\pL\w*
will find sequences that starts with a letter (\pL
is a Unicode property for letter), followed by 0 or more "word" characters (Unicode letters and numbers, because of the modifier Pattern.UNICODE_CHARACTER_CLASS
).
The lookbehind assertion (?<=^|\s)
ensures that there is the start of the string or a whitespace before the sequence.
So my code will print:
test
séntènce ==> contains non ASCII letters
îwhere ==> starts with a non ASCII letter
I ==> 'm is missing, because `'` is not in `\w`
want
üfind ==> starts with a non ASCII letter
starting
le11ers ==> contains digits
Missing words:
,to ==> starting with a ","
1words ==> starting with a digit
$with ==> starting with a "$"