0

I have very long html string which has multiple

             <dl id="divmap"> .... </dl>.

I want to remove all content between this .

i wrote this code in java:

                                   String triphtml= htmlString;
                System.out.println("triphtml is "+triphtml);

                System.out.println("test1 ");
                final Pattern pattern = Pattern.compile("(<dl id=\""+selectedArray[i]+"\">)(.+?)(</dl>)",
                        Pattern.DOTALL);
                final Matcher matcher = pattern.matcher(triphtml);
                // matcher.find();
                System.out.println("pattern of test1 is : "
                        + pattern); // Prints
                System.out.println("MATCHER of test1 is : "
                        + matcher); // Prints
                System.out.println("MATCH COUNT of test1 a: "
                        + matcher.groupCount()); // Prints
                System.out.println("MATCH COUNT of test1  a: "
                        + matcher.find()); // Prints
                while (matcher.find()) {
                    // System.out.println("MATCH GP 3: "+matcher.group(3).substring(1,10));

                    for (int z = 0; z <= matcher.groupCount(); z++) {
                        String extstr = matcher.group(z);
                        System.out.println("matcher group of "+z+" test1  is " + extstr);
                        System.out.println("ext a of test1  is " + extstr);
                        triphtml = triphtml.replaceAll(extstr, "");
                        System.out.println("Group found of test1 is :\n" + extstr);
                    }

                }

But this code removes some dl and some remains in triphtml. I dont why this thing is happening. Here triphtml is a html string which has multiple dl's. Please help me how I remove content between all

    <dl id="divmap">.

Thanks in advance.

user2727837
  • 51
  • 2
  • 9
  • I think [this answer](http://stackoverflow.com/a/1732454/2071828) might be helpful here. Yes, _that_ one. – Boris the Spider Dec 12 '13 at 09:12
  • I had used [HTML Cleaner](http://htmlcleaner.sourceforge.net) an year back and it worked well for me. – Anugoonj Dec 12 '13 at 09:21
  • I can't use HTML cleaner because HTML cleaner removes all html,but i want to remove certain part of HTML string. And I cant use HTMLparse lib because there is a lots of HTML errors. – user2727837 Dec 12 '13 at 09:24

3 Answers3

1

I suggest to NOT use regex for html. Just use any library used for traversing xml/html.

For example JSoup

Marcin Szymczak
  • 11,199
  • 5
  • 55
  • 63
0

By using regex you can do as follows:

String orgString = "<dl id=\"divmap\"> .... </dl>";

orgString = orgString.replaceAll("<[^>]*>", "");
//for removing html tag

orgString = orgString.replaceAll(orgString.replaceAll("<[^>]*>", ""),"");
//for removing content inside html tag

But it is better to use html parsing

Edit:

String htmlString = "<dl id=\"divmap\"> Content </dl>";
Pattern p = Pattern.compile("<[^>]*>");
Matcher m = p.matcher(htmlString);
while(m.find()){
    htmlString = htmlString.replaceAll(m.group(), "");
}
System.out.println("Ans"+htmlString);
Rakesh KR
  • 6,357
  • 5
  • 40
  • 55
  • May be this is simple JAVA code, You are not using any regex here. Am i right? – user2727837 Dec 12 '13 at 09:27
  • @user2727837 **NO** `<[^>]*>` is a regex for finging the html tag :-) – Rakesh KR Dec 12 '13 at 09:34
  • Sorry bro,ya you are right, in your code its working but in my code its not working. You can also check. here is my HTMLString "http://pastebin.com/iqiWCGw7". – user2727837 Dec 12 '13 at 10:33
  • @user2727837 The Link You Shared Contains Nothing... _This paste has been removed!_ – Rakesh KR Dec 12 '13 at 10:42
  • I think the porblem in storing the content in to String u neeed to store the content in to String as http://pastebin.com/i7d6Gq6h – Rakesh KR Dec 12 '13 at 11:02
  • I cant do this, What you said. Because I am generating this HTML String from code. It's just like we are getting HTML file as a string. – user2727837 Dec 12 '13 at 11:07
0

Try using JSoup

It uses selectors and syntax like JQuery, it it very easy to use.

You can try this

String triphtml = htmlString;

Document doc = Jsoup.parse(htmlString);
Elements divmaps = doc.select("#divmap");

then you can remove (or alter) the elements in the DOM.

divmaps.remove();
triphtml = doc.html();
vzamanillo
  • 9,905
  • 1
  • 36
  • 56
  • Great, But this take too much time. My all code will be modify. I don't have too much time to write new code. Will you please help me to find mistake in my code? – user2727837 Dec 12 '13 at 09:31
  • All your dl has the same id "divmap"? if is true, change the pattern to final Pattern pattern = Pattern.compile("(
    )(.+?)(
    )", Pattern.DOTALL); and then inside the matcher.find() while get the String extstr with the matcher.group(2) and then replace: String extstr = matcher.group(2); triphtml = triphtml.replaceAll(extstr, "")
    – vzamanillo Dec 12 '13 at 09:50
  • Yes, My all dl has same id "divmap". But my pattern is all-ready like your pattern. Difference is i am getting id by a array( variable). And remaining thing is same as you said (All-ready describe in my question). But not getting correct result . – user2727837 Dec 12 '13 at 09:54
  • You have match and replace the group index = 2 only, matcher.group(2), look at this sample http://pastebin.com/3RmNC4vs – vzamanillo Dec 12 '13 at 09:59
  • But, do you want to remove the complete tag or the content of the tag only? – vzamanillo Dec 12 '13 at 10:23
  • Ya, Working in this code. But when i am implementing this in my code, then not working. Don't know why? – user2727837 Dec 12 '13 at 10:26
  • Sorry bro, in your code its working but in my code its not working. You can also check. here is my HTMLString http://pastebin.com/iqiWCGw7 – user2727837 Dec 12 '13 at 10:43
  • Wow today I implement jSoup lib, and its work like honey, but problem is it modify my document type from w3org to jsoup. and if my document in other lang like ch(Chinese), Then all contents will be modify. – user2727837 Dec 13 '13 at 04:07