Parse html with jsoup and remove the tag block

Question

I want to remove everything between a tag. An example input may be

Input:

<body>
  start
  <div>
    delete from below
    <div class="XYZ">
      first div having this class
      <div>
        waste
      </div>
      <div class="XYZ">
        second div having this class
      </div>
      waste
    </div>
    delete till above
  </div>
  <div>
    this will also remain
  </div>
  end
</body>

The output will be:

<body>
  start
  <div>
    delete from below
    delete till above
  </div>
  <div>
    this will also remain
  </div>
  end
</body>

Basically, I have to remove the entire block for the first occurrence of <div class="XYZ">

Thanks,

I found the answer in the Jsoup selector. Solution will be something like: Document doc = Jsoup.parse(html); doc.select("div.XYZ").first().remove(); return doc.body().outerHtml(); But here is one problem, when I ran this, it gave me correct answer for the html string that has `
` but it will return `java.lang.NullPointerException` if `
` is not present in the input html string. Do I need to check everything and do the step only if I find the div of that type? Thanks. — user2200660, Apr 03 '13 at 19:07
No you can't. but you can post your solution as an *answer* an then accept it (see: http://blog.stackoverflow.com/2011/07/its-ok-to-ask-and-answer-your-own-questions/). if your question is solved, please to it that way. — ollo, Apr 03 '13 at 19:10

ollo · Accepted Answer · 2013-04-03T21:29:39.757

16

You better iterate over all elements found. so you can be shure that

a.) all elements are removed and
b.) there's nothing done if there's no element.

Example:

Document doc = ...

for( Element element : doc.select("div.XYZ") )
{
    element.remove();
}

Edit:

( An addition to my comment )

Don't use exception handling when a simple null- / range check is enough here:

doc.select("div.XYZ").first().remove();

instead:

Elements divs = doc.select("div.XYZ");

if( !divs.isEmpty() )
{
    /*
     * Here it's safe to call 'first()' since there at least one element.
     */
}

edited Apr 03 '13 at 21:29

answered Apr 03 '13 at 19:18

ollo

24,797
14
106
155

Thanks ollo... I did it with doc.select("div.XYZ").first.remove(); and kept this in a try block, if an exception is caught (that means required field is absent), it will return the original string. That way it was solved. But yours is a better way of doing this. – user2200660 Apr 03 '13 at 21:15
2

*Don't* use exceptionhandling for this - a simple *null check* is much better. – ollo Apr 03 '13 at 21:24
Thanks once again @ollo.... Your solution gave me solution for another problem. where I have to check the html tag for a particular owntext and remove it. I did not know that you can remove element while iterating it in for-each loop, and this is coming to be handy . Thanks..... – user2200660 Apr 03 '13 at 21:24
Please see my edit. btw. you're right; using a loop is better since the loop is skipped if there are no elements (`first()` throws a `NullPointerException` instead). – ollo Apr 03 '13 at 21:29
It's possible to use a selector which selects only those elements with have a owntext (or have none). – ollo Apr 03 '13 at 21:32
can you have two attr for an element. Like I have to match
... For this I am doing doc.select("blockquote[cite~="+regex+"]"); where regex = "\".*?\"" But this is not happening
– user2200660 Apr 03 '13 at 22:08
2

got the solution, you can AND the attributes condition using [][]... Though comma is used for OR – user2200660 Apr 03 '13 at 22:50
So if your question is solved, feel free to [accept this answer](http://stackoverflow.com/faq#howtoask). – ollo Apr 04 '13 at 13:39

score 1 · Answer 2 · answered Apr 03 '13 at 19:12

1

Try this code :

String data = null;
    BufferedReader br = new BufferedReader(new FileReader("e://XMLFile.xml"));
    StringBuilder builder = new StringBuilder();
    while ((data = br.readLine()) != null) {
        builder.append(data);
    }
    System.out.println(builder);
    String replaceAll = builder.toString().replaceAll("<div class=\"XYZ\".+?</div>", "");
    System.out.println(replaceAll);

I have read the input XML from a file and stored it in a StringBuilder object by reading it line by line, and then replaced the entire tag will empty string.

answered Apr 03 '13 at 19:12

Ankur Shanbhag

7,746
2
28
38

2

You can easily use Jsoup for that. in general its a better idea not to use regex but a html library (like jsoup) – ollo Apr 03 '13 at 19:14
@ollo : thanks for the information. I did not know about Jsoup. – Ankur Shanbhag Apr 03 '13 at 19:14
@ Ankur. Thanks, But I gave the solution using jsoup in the comment. Document doc = Jsoup.parse(html); doc.select("div.XYZ").first().remove(); return doc.body().outerHtml(); – user2200660 Apr 03 '13 at 21:10

score 1 · Answer 3 · answered Oct 19 '17 at 10:34

1

This may help you.

 String selectTags="div,li,p,ul,ol,span,table,tr,td,address,em";
 /*selecting some specific tags */
 Elements webContentElements = parsedDoc.select(selectTags); 
 String removeTags = "img,a,form"; 
 /*Removing some tags from selected elements*/
 webContentElements.select(removeTags).remove();

answered Oct 19 '17 at 10:34

Stephen

834
7
13

Is there any difference between this and the accepted answer? The corner case OP mentioned also seems to be covered. – Prajeeth Emanuel Aug 06 '19 at 11:57

score 0 · Answer 4 · answered Apr 05 '13 at 18:25

I asked this problem yesterday and thanks to ollo's answer. It was solved. There is en extension of the above problem. I did not know if I have to start a new post or chain this one. So, in this confusion I am chaining it here.. Admins pls, pardon me if I had to make a separate post for this.

In the above problem, I have to remove a tag block with matching component.

The real scenario is: It should remove the tag block with matching component + remove <br /> surrounding it.

Referring to the above example.

<body>
  start
  <div>
    delete from below
    <br />
    <br />
    <div class="XYZ">
      first div having this class
      <div>
        waste
      </div>
      <div class="XYZ">
        second div having this class
      </div>
      waste
    </div>
    <br />
    delete till above
  </div>
  <div>
    this will also remain
  </div>
  end
</body>

should also give the same output:

<body>
  start
  <div>
    delete from below
    delete till above
  </div>
  <div>
    this will also remain
  </div>
  end
</body>

Because it has <br /> above and below the html tag block to remove....

Just to re-iterate, I am using the solution given by ollo to match and remove the tag block.

for( Element element : doc.select("div.XYZ") )
{
    element.remove();
}

Thanks, Shekhar

Parse html with jsoup and remove the tag block

4 Answers4

Linked