0

I have the following sentence:

String str = " And God said, <sup>c</sup>&#8220;Let there be light,&#8221; and there was light.";

How do I retrieve all of the words in the sentence, expecting the following?

And
God
said
Let 
there
be
light
and 
there
was
light
blacktide
  • 10,654
  • 8
  • 33
  • 53
  • Do you need to get rid of the content between the `sub` tags? Or just get rid of all tags and display the words? – joel314 Apr 16 '16 at 14:43

2 Answers2

1

First, get rid of any leading or trailing space:

.trim()

Then get rid of HTML entities (&...;):

.replaceAll("&.*?;", "")

& and ; are literal chars in Regex, and .*? is the non-greedy version of "any character, any number of times".

Next get rid of tags and their contents:

.replaceAll("<(.*?)>.*?</\\1>", "")

< and > will be taken literally again, .*? is explained above, (...) defined a capturing group, and \\1 references that group.

And finally, split on any sequence of non-letters:

.split("[^a-zA-Z]+")

[a-zA-Z] means all characters from a to z and A to Z, ^ inverts the match, and + means "once or more".

So everything together would be:

String words = str.trim().replaceAll("&.*?;", "").replaceAll("<(.*?)>.*?</\\1>", "").split("[^a-zA-Z]+");

Note that this doesn't handle self-closing tags like <img src="a.png" />.
Also note that if you need full HTML parsing, you should think about letting a real engine parse it, as parsing HTML with Regex is a bad idea.

Siguza
  • 21,155
  • 6
  • 52
  • 89
0

You can use String.replaceAll(regex, replacement) with the regex [^A-Za-z]+ like this to get only characters. Which will also include the sup tag and the c. Which is why you replace the tags and all between them with the first statement.

    String str = " And God said, <sup>c</sup>&#8220;Let there be light,&#8221; and there was light.".replaceAll("<sup>[^<]</sup>", "");
    String newstr = str.replaceAll("[^A-Za-z]+", " ");
tbhall
  • 11
  • 2