Regex to strip out greater than > and less than < characters from HTML string ignoring existing tags

Question

I have not a lot of experience with regular expression and have an issue where I need to replace all instances of > and < with < and > but to leave the HTML tags in tack.

For example:

String string =" <p class=\"anotherClass\"> Here is some text the value is for H<sub>2</sub>O is > 1 and < 100 <p>";
//need to be converted to:
<p class=\"anotherClass\"> Here is some text the value is for H<sub>2</sub>O is  &gt; 1 and  &lt; 100 <p>";

I have tried some look and ahead and behind expressions but I can not seem to get any of them to work. For example:

String string =" <p class=\"anotherClass\"> Here is some text the value is for H<sub>2</sub>) is > 1 and < 100 <p>";

String reg1="<(?=[^>\\/]*<\\/)";


Pattern p1 = Pattern.compile(reg1);

test = p1.matcher(string).replaceAll("&lt;");

Does not seem to have any effect.

I wondered if anyone else had come across this before or if anyone can give me any guidance?

[Don't even try](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454). Use an HTML parser, which will figure out where those `>` and `<` characters are (to the extent possible), and then have it serialize the result. Note that it's perfectly valid, for instance, to write `
foo
`. — T.J. Crowder, May 29 '15 at 17:12
I have to agreed. You won't be able to handle that with regex. — jHilscher, May 29 '15 at 17:13
Yes, use a parser. [You have a lot of choices.](http://en.wikipedia.org/wiki/Comparison_of_HTML_parsers) (Actually, now that I think about it, this is mal-formed HTML. The < and > symbols should be already escaped. You are basically stuck, you may have to resort to a general XML parser and then sort out tags that aren't really tags.) — markspace, May 29 '15 at 17:19

score 2 · Answer 1 · answered May 29 '15 at 17:27

If all < and > are only present in their escaped version (< and >) you would be able to match and remove them using regex.

But if they aren't (which seems to be your case), ultimately, you can't match with 100% accuracy only using regex due to the nested nature of the HTML/XML tags.

Your best bet is an HTML Parser, like jsoup:

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
public class JsoupExtractGtLt {
    public static void main(String[] args) {
        String html = "<p class=\"anotherClass\"> Here is some text the value is for H<sub>2</sub>) is > 1 and < 100 <p>";
        Document doc = Jsoup.parseBodyFragment(html);
        String parsedHTML = doc.body().unwrap().toString();
        System.out.println(parsedHTML);
    }
}

Output:

 <p class="anotherClass"> Here is some text the value is for H<sub>2</sub>) is &gt; 1 and &lt; 100 </p>

Thanks this is going to be very useful! – Megan Eisenbraun May 30 '15 at 07:10 — Megan Eisenbraun, May 30 '15 at 07:10

score 2 · Answer 2 · answered May 30 '15 at 03:22

Using regex alone to "parse" HTML markup comes with some hefty caveats, which many, many folks here on SA have commented on. However, your request is relatively modest.

Naked < symbols between tags can be found with <(?=[^>]*(?:<|$)) and replaced by <.

Naked > symbols between tags can be found with ((?:^|>)[^<]*?)> and replaced by \1>.

Note that both must be done on the whole string (not by line). E.g. . must match \n, ^ must match the beginning of the string (not the line), and $ must match the end of the string (not the line).

Note also that each must be performed multiple times until no results are left, since only one replacement can be made at a time between tags.

Caveats:

This only finds and replaces stray < or > symbols This between tags, NOT in tags themselves. That means that it will mess up on something like <a href="/link/with/</symbol/in/it">.
You should, if practical, have a human check the resulting changes for validity, or at least run it through an automated checker.
These regexes are time-expensive, so may not be practical if speed is an issue.

To reiterate points made by others, please consider a markup parser instead, if doing any work with untrusted inputs.

This is great- I wll be using a parser in some instances. But this could be very useful for helping to validate inside the database. Thanks for your time — Megan Eisenbraun, May 30 '15 at 07:11

Regex to strip out greater than > and less than < characters from HTML string ignoring existing tags

2 Answers2