4

Good morning. I realize there are a ton of questions out there regarding replace and replaceAll() but i havnt seen this.

What im looking to do is parse a string (which contains valid html to a point) then after I see the second instance of <p> in the string i want to remove everything that starts with & and ends with ; until i see the next </p>

To do the second part I was hoping to use something along the lines of s.replaceAll("&*;","")

That doesnt work but hopefully it gets my point across that I am looking to replace anything that starts with & and ends with ;

Brian
  • 17,079
  • 6
  • 43
  • 66
Deslyxia
  • 619
  • 4
  • 11
  • 32
  • 1
    Correct me if I'm wrong, but I think you want "

    " and "

    " to show up in this question, right? You'll want to edit your question and mark those strings as code and then they will show up if that's the case.
    – Zack Macomber Sep 11 '12 at 20:00

2 Answers2

9

You should probably leave the parsing to a DOM parser (see this question). I can almost guarantee you'll have to do this to find text within the <p> tags.

For the replacement logic, String.replaceAll uses regular expressions, which can do the matching you want.

The "wildcard" in regular expressions that you want is the .* expression. Using your example:

String ampStr = "This &escape;String";
String removed = ampStr.replaceAll("&.*;", "");
System.out.println(removed);

This outputs This String. This is because the . represents any character, and the * means "this character 0 or more times." So .* basically means "any number of characters." However, feeding it:

"This &escape;String &anotherescape;Extended"

will probably not do what you want, and it will output This Extended. To fix this, you specify exactly what you want to look for instead of the . character. This is done using [^;], which means "any character that's not a semicolon:

String removed = ampStr.replaceAll("&[^;]*;", "");

This has performance benefits over &.*?; for non-matching strings, so I highly recommend using this version, especially since not all HTML files will contain a &abc; token and the &.*?; version can have huge performance bottle-necks as a result.

Brian
  • 17,079
  • 6
  • 43
  • 66
  • Quick note: replaceAll takes two arguments: replaceAll(regex, replacement). So your examples should be: `String removed = ampStr.replaceAll("&.*;", "");` and `String removed = ampStr.replaceAll("&[^;]*;", "");` – Wingie Aug 30 '18 at 08:27
  • Thanks @Wingie, fixed! – Brian Aug 30 '18 at 14:02
1

The expression you want is:

s.replaceAll("&.*?;","");

But do you really want to be parsing HTML this way? You may be better off using an XML parser.

Jon Lin
  • 142,182
  • 29
  • 220
  • 220
  • I think the OP stated they want this to occur after the SECOND instance of "

    " up to "

    "...this code removes any portion in the String between (and including) &;
    – Zack Macomber Sep 11 '12 at 20:10