First, get rid of any leading or trailing space:
.trim()
Then get rid of HTML entities (&...;
):
.replaceAll("&.*?;", "")
&
and ;
are literal chars in Regex, and .*?
is the non-greedy version of "any character, any number of times".
Next get rid of tags and their contents:
.replaceAll("<(.*?)>.*?</\\1>", "")
<
and >
will be taken literally again, .*?
is explained above, (...)
defined a capturing group, and \\1
references that group.
And finally, split on any sequence of non-letters:
.split("[^a-zA-Z]+")
[a-zA-Z]
means all characters from a
to z
and A
to Z
, ^
inverts the match, and +
means "once or more".
So everything together would be:
String words = str.trim().replaceAll("&.*?;", "").replaceAll("<(.*?)>.*?</\\1>", "").split("[^a-zA-Z]+");
Note that this doesn't handle self-closing tags like <img src="a.png" />
.
Also note that if you need full HTML parsing, you should think about letting a real engine parse it, as parsing HTML with Regex is a bad idea.