7

I want to remove everything between a tag. An example input may be

Input:

<body>
  start
  <div>
    delete from below
    <div class="XYZ">
      first div having this class
      <div>
        waste
      </div>
      <div class="XYZ">
        second div having this class
      </div>
      waste
    </div>
    delete till above
  </div>
  <div>
    this will also remain
  </div>
  end
</body>

The output will be:

<body>
  start
  <div>
    delete from below
    delete till above
  </div>
  <div>
    this will also remain
  </div>
  end
</body>

Basically, I have to remove the entire block for the first occurrence of <div class="XYZ">

Thanks,

user2200660
  • 1,261
  • 3
  • 18
  • 23
  • What have you done so far? – ollo Apr 03 '13 at 19:07
  • I found the answer in the Jsoup selector. Solution will be something like: Document doc = Jsoup.parse(html); doc.select("div.XYZ").first().remove(); return doc.body().outerHtml(); But here is one problem, when I ran this, it gave me correct answer for the html string that has `
    ` but it will return `java.lang.NullPointerException` if `
    ` is not present in the input html string. Do I need to check everything and do the step only if I find the div of that type? Thanks.
    – user2200660 Apr 03 '13 at 19:07
  • I cannot answer my own question??? awwww – user2200660 Apr 03 '13 at 19:08
  • No you can't. but you can post your solution as an *answer* an then accept it (see: http://blog.stackoverflow.com/2011/07/its-ok-to-ask-and-answer-your-own-questions/). if your question is solved, please to it that way. – ollo Apr 03 '13 at 19:10

4 Answers4

16

You better iterate over all elements found. so you can be shure that

  • a.) all elements are removed and
  • b.) there's nothing done if there's no element.

Example:

Document doc = ...

for( Element element : doc.select("div.XYZ") )
{
    element.remove();
}

Edit:

( An addition to my comment )

Don't use exception handling when a simple null- / range check is enough here:

doc.select("div.XYZ").first().remove();

instead:

Elements divs = doc.select("div.XYZ");

if( !divs.isEmpty() )
{
    /*
     * Here it's safe to call 'first()' since there at least one element.
     */
}
ollo
  • 24,797
  • 14
  • 106
  • 155
  • Thanks ollo... I did it with doc.select("div.XYZ").first.remove(); and kept this in a try block, if an exception is caught (that means required field is absent), it will return the original string. That way it was solved. But yours is a better way of doing this. – user2200660 Apr 03 '13 at 21:15
  • 2
    *Don't* use exceptionhandling for this - a simple *null check* is much better. – ollo Apr 03 '13 at 21:24
  • Thanks once again @ollo.... Your solution gave me solution for another problem. where I have to check the html tag for a particular owntext and remove it. I did not know that you can remove element while iterating it in for-each loop, and this is coming to be handy . Thanks..... – user2200660 Apr 03 '13 at 21:24
  • Please see my edit. btw. you're right; using a loop is better since the loop is skipped if there are no elements (`first()` throws a `NullPointerException` instead). – ollo Apr 03 '13 at 21:29
  • It's possible to use a selector which selects only those elements with have a owntext (or have none). – ollo Apr 03 '13 at 21:32
  • can you have two attr for an element. Like I have to match
    ... For this I am doing doc.select("blockquote[cite~="+regex+"]"); where regex = "\".*?\"" But this is not happening
    – user2200660 Apr 03 '13 at 22:08
  • 2
    got the solution, you can AND the attributes condition using [][]... Though comma is used for OR – user2200660 Apr 03 '13 at 22:50
  • So if your question is solved, feel free to [accept this answer](http://stackoverflow.com/faq#howtoask). – ollo Apr 04 '13 at 13:39
1

Try this code :

String data = null;
    BufferedReader br = new BufferedReader(new FileReader("e://XMLFile.xml"));
    StringBuilder builder = new StringBuilder();
    while ((data = br.readLine()) != null) {
        builder.append(data);
    }
    System.out.println(builder);
    String replaceAll = builder.toString().replaceAll("<div class=\"XYZ\".+?</div>", "");
    System.out.println(replaceAll);

I have read the input XML from a file and stored it in a StringBuilder object by reading it line by line, and then replaced the entire tag will empty string.

Ankur Shanbhag
  • 7,746
  • 2
  • 28
  • 38
  • 2
    You can easily use Jsoup for that. in general its a better idea not to use regex but a html library (like jsoup) – ollo Apr 03 '13 at 19:14
  • @ollo : thanks for the information. I did not know about Jsoup. – Ankur Shanbhag Apr 03 '13 at 19:14
  • @ Ankur. Thanks, But I gave the solution using jsoup in the comment. Document doc = Jsoup.parse(html); doc.select("div.XYZ").first().remove(); return doc.body().outerHtml(); – user2200660 Apr 03 '13 at 21:10
1

This may help you.

 String selectTags="div,li,p,ul,ol,span,table,tr,td,address,em";
 /*selecting some specific tags */
 Elements webContentElements = parsedDoc.select(selectTags); 
 String removeTags = "img,a,form"; 
 /*Removing some tags from selected elements*/
 webContentElements.select(removeTags).remove();
Stephen
  • 834
  • 7
  • 13
0

I asked this problem yesterday and thanks to ollo's answer. It was solved. There is en extension of the above problem. I did not know if I have to start a new post or chain this one. So, in this confusion I am chaining it here.. Admins pls, pardon me if I had to make a separate post for this.

In the above problem, I have to remove a tag block with matching component.

The real scenario is: It should remove the tag block with matching component + remove <br /> surrounding it.

Referring to the above example.

<body>
  start
  <div>
    delete from below
    <br />
    <br />
    <div class="XYZ">
      first div having this class
      <div>
        waste
      </div>
      <div class="XYZ">
        second div having this class
      </div>
      waste
    </div>
    <br />
    delete till above
  </div>
  <div>
    this will also remain
  </div>
  end
</body>

should also give the same output:

<body>
  start
  <div>
    delete from below
    delete till above
  </div>
  <div>
    this will also remain
  </div>
  end
</body>

Because it has <br /> above and below the html tag block to remove....

Just to re-iterate, I am using the solution given by ollo to match and remove the tag block.

for( Element element : doc.select("div.XYZ") )
{
    element.remove();
}

Thanks, Shekhar

user2200660
  • 1,261
  • 3
  • 18
  • 23