1

Lets' say I have this html code in my String variable;

String htmlCode = "<span class='test'>test</span>"+
+"<a href=\"http://foo.com?id=<span class='test'>test</span>\">link</a>";

The htmlCode variable would contain more links similar to that, plus it would also contain more spans similar to that.

I want to replace everything in between tags <span and </span> including those spans, but only if they are in <a href tag. Meaning that I don't want to replace the first span tag, but I want to replace the second one.

I know that regex can do that, but so far I was able to do this:

htmlCode = htmlCode.replaceAll("<span.*?</span>", "");

But how do I define that I want to replace it only if it occurs in the <a> tag? Plus is there a way to replace it including those span tags?

Ondrej Tokar
  • 4,898
  • 8
  • 53
  • 103
  • 4
    You should use a HTML parser for this purpose, like [JSoup](http://jsoup.org/). You could use the `a>span` selector and remove all the returned nodes. – BackSlash Aug 03 '15 at 13:45
  • Actually I am using a JSoup and I will use that. I was wondering though how would you do it with a string. Since it wouldn't have to be HTML, right? But thank you. – Ondrej Tokar Aug 03 '15 at 13:47
  • @BackSlash You should post that as answer. – Pshemo Aug 03 '15 at 13:48
  • Someone may (or may not) be able to design such a regex. Then you will realize it doesn't apply to nested `span` and/or nested `a` tags... – dotvav Aug 03 '15 at 13:49
  • @Pshemo We have all been joked :) Look at the code, the span is **not** *child* of the `a` tag :) JSoup will do nothing here with the `a>span` selector – BackSlash Aug 03 '15 at 13:53
  • @BackSlash `href=\"http://foo.com?id=test\"` It doesn't make sense. Who does that? – Pshemo Aug 03 '15 at 13:54
  • @Pshemo we do it for a merge field in our system, that span value is then replaced by a string value. However, when it is in PDF file it is not replaced and link doesn't work. Therefore I need to remove the span tags from the link. – Ondrej Tokar Aug 03 '15 at 13:56
  • @Pshemo It seems they do. Although I find this placeholding system a bit weird... – BackSlash Aug 03 '15 at 14:02
  • @backslash it is an Eloqua system of Oracle... pretty widely used. Wonders every day huh? Any idea for a solution? – Ondrej Tokar Aug 03 '15 at 14:04
  • @OndrejTokar Please, see if my answer can help – BackSlash Aug 03 '15 at 14:09

1 Answers1

3

If I understand your question correctly you want to remove span tags from href value of your a tag.

In that case you can try with something like

String htmlCode = "<span class='test'>test</span>"
        + "<a href=\"http://foo.com?id=<span class='test'>test</span>\">link</a>"
        + "<a href=\"http://foo.com?id=test2\">link</a>";
Document doc = Jsoup.parse(htmlCode);
System.out.println(doc);

for (Element el : doc.select("a[href*=<span]")){//select a with href which contains `<span`
    el.attr("href", Jsoup.parse(el.attr("href")).text());//sets new value for `href` attribute which will be 
    //parsed "http://foo.com?id=<span class='test'>test</span>" and text it represents
}

System.out.println("-----");
System.out.println(doc);

Output (before/after):

<html>
 <head></head>
 <body>
  <span class="test">test</span>
  <a href="http://foo.com?id=<span class='test'>test</span>">link</a>
  <a href="http://foo.com?id=test2">link</a>
 </body>
</html>
-----
<html>
 <head></head>
 <body>
  <span class="test">test</span>
  <a href="http://foo.com?id=test">link</a>
  <a href="http://foo.com?id=test2">link</a>
 </body>
</html>
Pshemo
  • 122,468
  • 25
  • 185
  • 269
  • Thank you, but I don't understand how will that `Jsoup.parse(el.attr("href")).text()` remove the `span` tag. – Ondrej Tokar Aug 03 '15 at 14:17
  • 1
    @OndrejTokar The `text()` method strips out tags from the parsed string, so it will remove `` and `` – BackSlash Aug 03 '15 at 14:21
  • @OndrejTokar `Jsoup.parse` returns `Document` which extends `Element` class and inherits from it `text()` method. You can find its documentation here: http://jsoup.org/apidocs/org/jsoup/nodes/Element.html#text-- In short in returns only text which will be generated from stored HTML structure like `helloworld` will become `helloworld`. It kind of simulates what text you will see in browser. – Pshemo Aug 03 '15 at 14:21
  • I have used the other solution, but since it isn't here, I assume you did some kind of quiet agreement so I will accept your answer. – Ondrej Tokar Aug 03 '15 at 14:26
  • Is my question that bad it has a negative value? If you think it isn't that bad I would be grateful for +1 rating so I don't have it -1 ;/. Thanks – Ondrej Tokar Aug 03 '15 at 14:27
  • @OndrejTokar I didn't do anything, @BackSlash probably removed his solution since you claimed that it didn't work correctly. Later it was updated but he probably thinks that `Jsoup.parse(..).text()` is better than `replaceAll` so he decided to remove his answer. But that is only mu suspicions. – Pshemo Aug 03 '15 at 14:27
  • @OndrejTokar I removed my answer, because it used regexes to parse the `span` elements in the `href`. I think this answer is better – BackSlash Aug 03 '15 at 14:29
  • @OndrejTokar Your question is not bad, but people often get on tilt when they see someone trying to combine `HTML + regex` and they downvote automatically. I often agree with them, because of http://stackoverflow.com/q/701166/1393766 which is also the reason for this answer http://stackoverflow.com/a/1732454/1393766 – Pshemo Aug 03 '15 at 14:32
  • @Pshemo thank you for the explanation. It is shame people are so negative without checking it better ... ;) – Ondrej Tokar Aug 04 '15 at 06:43