is it possible to use replaceAll() with wildcards

Question

Good morning. I realize there are a ton of questions out there regarding replace and replaceAll() but i havnt seen this.

What im looking to do is parse a string (which contains valid html to a point) then after I see the second instance of <p> in the string i want to remove everything that starts with & and ends with ; until i see the next </p>

To do the second part I was hoping to use something along the lines of s.replaceAll("&*;","")

That doesnt work but hopefully it gets my point across that I am looking to replace anything that starts with & and ends with ;

Correct me if I'm wrong, but I think you want "
" and "
" to show up in this question, right? You'll want to edit your question and mark those strings as code and then they will show up if that's the case. — Zack Macomber, Sep 11 '12 at 20:00

Brian · Accepted Answer · 2018-08-30T14:02:19.293

You should probably leave the parsing to a DOM parser (see this question). I can almost guarantee you'll have to do this to find text within the <p> tags.

For the replacement logic, String.replaceAll uses regular expressions, which can do the matching you want.

The "wildcard" in regular expressions that you want is the .* expression. Using your example:

String ampStr = "This &escape;String";
String removed = ampStr.replaceAll("&.*;", "");
System.out.println(removed);

This outputs This String. This is because the . represents any character, and the * means "this character 0 or more times." So .* basically means "any number of characters." However, feeding it:

"This &escape;String &anotherescape;Extended"

will probably not do what you want, and it will output This Extended. To fix this, you specify exactly what you want to look for instead of the . character. This is done using [^;], which means "any character that's not a semicolon:

String removed = ampStr.replaceAll("&[^;]*;", "");

This has performance benefits over &.*?; for non-matching strings, so I highly recommend using this version, especially since not all HTML files will contain a &abc; token and the &.*?; version can have huge performance bottle-necks as a result.

Quick note: replaceAll takes two arguments: replaceAll(regex, replacement). So your examples should be: `String removed = ampStr.replaceAll("&.*;", "");` and `String removed = ampStr.replaceAll("&[^;]*;", "");` — Wingie, Aug 30 '18 at 08:27

score 1 · Answer 2 · answered Sep 11 '12 at 20:03

1

The expression you want is:

s.replaceAll("&.*?;","");

But do you really want to be parsing HTML this way? You may be better off using an XML parser.

answered Sep 11 '12 at 20:03

Jon Lin

142,182
29
220
220

I think the OP stated they want this to occur after the SECOND instance of "
" up to "
"...this code removes any portion in the String between (and including) &; – Zack Macomber Sep 11 '12 at 20:10

is it possible to use replaceAll() with wildcards

2 Answers2