You should probably leave the parsing to a DOM parser (see this question). I can almost guarantee you'll have to do this to find text within the <p>
tags.
For the replacement logic, String.replaceAll
uses regular expressions, which can do the matching you want.
The "wildcard" in regular expressions that you want is the .*
expression. Using your example:
String ampStr = "This &escape;String";
String removed = ampStr.replaceAll("&.*;", "");
System.out.println(removed);
This outputs This String
. This is because the .
represents any character, and the *
means "this character 0 or more times." So .*
basically means "any number of characters." However, feeding it:
"This &escape;String &anotherescape;Extended"
will probably not do what you want, and it will output This Extended
. To fix this, you specify exactly what you want to look for instead of the .
character. This is done using [^;]
, which means "any character that's not a semicolon:
String removed = ampStr.replaceAll("&[^;]*;", "");
This has performance benefits over &.*?;
for non-matching strings, so I highly recommend using this version, especially since not all HTML files will contain a &abc;
token and the &.*?;
version can have huge performance bottle-necks as a result.
" and "
" to show up in this question, right? You'll want to edit your question and mark those strings as code and then they will show up if that's the case. – Zack Macomber Sep 11 '12 at 20:00