How do I Java regex a DIV from with a page

Question

My problem is I want to delete a <div> xxx </div> from within an arbitrary page of HTML.

So given a page ...

<div> foo <div> bar <div> xxx </div> foo </div> bar </div>

I want to end up with

<div> foo <div> bar  foo </div> bar </div>

I thought that replaceFirst("<div.*?xxx.*?</div>", "") would do it. I assumed the magic ? would make the match lazy and leave the initial divs. However it insisted on being greedy and matching from the first div.

Since it took me an hour to find a solution, I'm posting my answer below to save those that follow.

Don't use regex on Html. Especially not malformd Html like you have here. — Roddy of the Frozen Peas, Oct 09 '12 at 19:42
@RoddyoftheFrozenPeas is correct. See [this post](http://www.codinghorror.com/blog/2009/11/parsing-html-the-cthulhu-way.html) for a summary of why it isn't a good idea. — Duncan Jones, Oct 09 '12 at 19:48
Fair point, but my environment prevents me using third party libraries, so I needed to solve this within with the standard libs. — pinoyyid, Oct 09 '12 at 20:02

score 1 · Answer 1 · edited May 23 '17 at 12:27

I think this might be a more correct way to accomplish this with a regex, assuming you want the last <div>:

"<div>((?!<div>).)*?xxx((?!<div>).)*?</div>"

Although I'm inclined to say that if you're using negative look-arounds like this, you might be better served finding a tool better suited to the task. This is academic, really. Interesting maybe. But this, and any of the provided solutions, won't do well if you up the complexity just a little bit from the, I'm guessing trivial, example provided.

For more on them though, there's a fantastic answer about them here: Regular expression to match a line that doesn't contain a word?

pinoyyid · Answer 2 · 2012-10-09T21:23:40.357

0

The answer I came up with is

.replaceFirst("<div[^(div)]*?xxx.*?</div>", ""); // WARNING - THIS IS BROKEN !!!

If there's a better solution, I'll be happy to endorse it. I still don't understand why my original version doesn't work, but all's well that ends well.

EDIT: as many pointed out, the above solution does NOT work when the inner div contains d i or v.

I ended up with

.replaceFirst("(?s)(<div.*)<div.*xxx.*?</div>","$1");

The consensus is that regex and HTML are like cabbage and custard. While I'm sure that's good advice, my specific scenario is (a) I have control over all of the HTML, and (b) I can't bring in external libraries. Given those specific considerations, I'm comfortable that regex works for me.

I hope those that follow will find this thread useful, and thanks for all of the contributions.

edited Oct 09 '12 at 21:23

answered Oct 09 '12 at 19:39

pinoyyid

21,499
14
64
115

1

This is still not proper I think! It would not wirk for this scenario: – ppeterka Oct 09 '12 at 19:54
1

(ehh, 5min edit criteria, and not hitting Shift with Enter...) @pinoyyid I think this is still not correct! It would not work for this scenario: `
d xxx
` But for HTML, as others mentioned, regexp is not the tool you'd want to use, because even with the solution posted by @JeffBowman if the div contains a simple ``, or even an even simpler `
`, the regex won't match it - not to mention cases when HTML is badly formed, which tends to occur far more frequently out in the wild than it should... – ppeterka Oct 09 '12 at 20:03

score 0 · Answer 3 · answered Oct 09 '12 at 19:52

A greedy match doesn't do quite what you expect it do; it will try to make the substring match as short as possible, but will still start matching from the first instance it sees. You also will not want have success with [^(div)], which according to Pattern docs will not match any of the characters d, i, v, (, or ).

I echo the recommendation against using regex on HTML; it is literally not expressive enough to parse HTML well. Instead, use an HTML parser and an XPath query.

If you are sure your DIV has no children, your closest approximation is to do something like this:

.replaceFirst("<div[^<]+?xxx.*?</div>")

...wherein the [^<]+? will prevent the first half from finding any DIV with child tags.

How do I Java regex a DIV from with a page

3 Answers3