Remove html which have class x from string in java

Question

Is there a good way to remove HTML from a Java string which have class "abc"? A simple regex like -

replaceAll("\\<.*?>","")

will remove all but i want to remove only those tag whose having class "abc".

<H1 class="abc">Hey</H1>
<H1 class="xyz">Hello</H1>

Remove h1 with class abc only. Note -> have to ddo it through regex not through parser because this is the only instance where i am modifying HTML in my code. Don't want additional JAR in my code.

Just want to confirm i dont want any parser, have to do it through reg x . — Vivek, Jan 07 '16 at 15:19
Possible duplicate of [RegEx match open tags except XHTML self-contained tags](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags) — BCartolo, Jan 07 '16 at 15:26
You want to remove only the tags or also the text between them? — user1803551, Jan 07 '16 at 16:16
@user1803551 - tag with text as well. there should be no h1 tag and text between it which having class "abc" — Vivek, Jan 07 '16 at 16:30

XesLoohc · Answer 1 · 2016-01-07T15:56:44.680

-1

This should Work

replaceAll("<h1[^>]*?class=\"*\'*abc\"*\'*>.*?h1>","")

edited Jan 07 '16 at 15:56

answered Jan 07 '16 at 15:32

XesLoohc

1
3

This would also remove `
hello
world
`. And what should `.*?` do? – Ctx Jan 07 '16 at 15:42
yes you are right @ctx . jst edited i think now this thing will do – XesLoohc Jan 07 '16 at 15:57
.*? is "match anything infinite times till" – XesLoohc Jan 07 '16 at 15:58
@XesLoohc - it's not working for me. i tried with this html and it's not removing the span tag with class "landingPage".
Skirts Landing H1Skirts Skirts SEO H1
with regx ]*?class=\"*\'*landingPage\"*\'*>.*?span> – Vivek Jan 07 '16 at 16:22
@Vivek It's your responsibility to provide input and expected output cases. You show a very simple case in your question and now counter an answer with complicated input that no one could have known about. – user1803551 Jan 07 '16 at 16:48
@vivek
Skirts Landing H1Skirts Skirts SEO H1
here id = landingpage and regex is for class='landingpage' ]*?id=\"*\'*landingPage\"*\'.*?>.*? I wrote it for jst a specific thing tht u mentioned – XesLoohc Jan 07 '16 at 18:48
still I m considering thr is no nested html in this tag – XesLoohc Jan 07 '16 at 18:49

user1803551 · Answer 2 · 2016-01-07T17:18:08.343

-1

Try

replaceAll("<[Hh]1 class=['\"]landingPage['\"]>.*?</[Hh]1>", "")

But note that since regex is not well-suited for this task, there might be unwanted results when it comes to complex HTML input.

For the input

<H1 class="abc">Hey</H1>
<H1 class="xyz">Hello</H1>

the output is

<H1 class="xyz">Hello</H1>

edited Jan 07 '16 at 17:18

answered Jan 07 '16 at 16:45

user1803551

12,965
5
47
74

Not working , returning the same string as output without removing the H1 – Vivek Jan 07 '16 at 16:57
@Vivek Works for me with the input you gave in the question. Show me how you implemented this line. – user1803551 Jan 07 '16 at 16:58
String html = "
Hey
Hello
"; System.out.println("formatted string:" + html.replaceAll("<[Hh]1 class=\"landingPage\">.*[Hh]1>","") This is removing both H1. – Vivek Jan 07 '16 at 17:04
@Vivek What is the syntax: `class ="name"` or `class='name'`? You are using both. – user1803551 Jan 07 '16 at 17:10
It can be both..but will happy with this as well class ="name". i changed the input string to have double quotes but result is same removing both H1. String html = "
Hey
Hello
"; – Vivek Jan 07 '16 at 17:13
@Vivek See my edited pattern. I made it to work with both `"` and `'`. – user1803551 Jan 07 '16 at 17:18
Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/100076/discussion-between-vivek-and-user1803551). – Vivek Jan 07 '16 at 17:32

Matthias · Answer 3 · 2016-01-07T15:28:36.017

-2

It's never a good idea to parse HTML using regex, see RegEx match open tags except XHTML self-contained tags

See Which HTML Parser is the best? for alternatives.

For example, using JSoup you could write something like this (untested):

Document doc = Jsoup.parse(html);
Elements elements = doc.select(".abc");
elements.remove();

edited Jan 07 '16 at 15:28

answered Jan 07 '16 at 15:14

Matthias

12,053
4
49
91

2

Don't add only links of other SO questions, if you think this is dupe close it as dupe. – Tushar Jan 07 '16 at 15:15
I wouldn't consider this a duplicate, so I added an example. – Matthias Jan 07 '16 at 15:29

Remove html which have class x from string in java

3 Answers3

hello

world

Skirts Landing H1Skirts Skirts SEO H1

Skirts Landing H1Skirts Skirts SEO H1

Hey

Hello

Hey

Hello