-3

Is there a good way to remove HTML from a Java string which have class "abc"? A simple regex like -

replaceAll("\\<.*?>","")

will remove all but i want to remove only those tag whose having class "abc".

<H1 class="abc">Hey</H1>
<H1 class="xyz">Hello</H1>

Remove h1 with class abc only. Note -> have to ddo it through regex not through parser because this is the only instance where i am modifying HTML in my code. Don't want additional JAR in my code.

Vivek
  • 10,978
  • 14
  • 48
  • 66

3 Answers3

-1

This should Work

replaceAll("<h1[^>]*?class=\"*\'*abc\"*\'*>.*?h1>","")
XesLoohc
  • 1
  • 3
  • This would also remove `

    hello

    world

    `. And what should `.*?` do?
    – Ctx Jan 07 '16 at 15:42
  • yes you are right @ctx . jst edited i think now this thing will do – XesLoohc Jan 07 '16 at 15:57
  • .*? is "match anything infinite times till" – XesLoohc Jan 07 '16 at 15:58
  • @XesLoohc - it's not working for me. i tried with this html and it's not removing the span tag with class "landingPage".

    Skirts Landing H1Skirts Skirts SEO H1

    with regx ]*?class=\"*\'*landingPage\"*\'*>.*?span>
    – Vivek Jan 07 '16 at 16:22
  • @Vivek It's your responsibility to provide input and expected output cases. You show a very simple case in your question and now counter an answer with complicated input that no one could have known about. – user1803551 Jan 07 '16 at 16:48
  • @vivek

    Skirts Landing H1Skirts Skirts SEO H1

    here id = landingpage and regex is for class='landingpage' ]*?id=\"*\'*landingPage\"*\'.*?>.*? I wrote it for jst a specific thing tht u mentioned
    – XesLoohc Jan 07 '16 at 18:48
  • still I m considering thr is no nested html in this tag – XesLoohc Jan 07 '16 at 18:49
-1

Try

replaceAll("<[Hh]1 class=['\"]landingPage['\"]>.*?</[Hh]1>", "")

But note that since regex is not well-suited for this task, there might be unwanted results when it comes to complex HTML input.

For the input

<H1 class="abc">Hey</H1>
<H1 class="xyz">Hello</H1>

the output is

<H1 class="xyz">Hello</H1>
user1803551
  • 12,965
  • 5
  • 47
  • 74
  • Not working , returning the same string as output without removing the H1 – Vivek Jan 07 '16 at 16:57
  • @Vivek Works for me with the input you gave in the question. Show me how you implemented this line. – user1803551 Jan 07 '16 at 16:58
  • String html = "

    Hey

    Hello

    "; System.out.println("formatted string:" + html.replaceAll("<[Hh]1 class=\"landingPage\">.*[Hh]1>","") This is removing both H1.
    – Vivek Jan 07 '16 at 17:04
  • @Vivek What is the syntax: `class ="name"` or `class='name'`? You are using both. – user1803551 Jan 07 '16 at 17:10
  • It can be both..but will happy with this as well class ="name". i changed the input string to have double quotes but result is same removing both H1. String html = "

    Hey

    Hello

    ";
    – Vivek Jan 07 '16 at 17:13
  • @Vivek See my edited pattern. I made it to work with both `"` and `'`. – user1803551 Jan 07 '16 at 17:18
  • Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/100076/discussion-between-vivek-and-user1803551). – Vivek Jan 07 '16 at 17:32
-2

It's never a good idea to parse HTML using regex, see RegEx match open tags except XHTML self-contained tags

See Which HTML Parser is the best? for alternatives.

For example, using JSoup you could write something like this (untested):

Document doc = Jsoup.parse(html);
Elements elements = doc.select(".abc");
elements.remove();
Matthias
  • 12,053
  • 4
  • 49
  • 91