Java, remove link to certain web site from html using regexp

Question

i need to remove all links to certain web-site, for example http://my-domain.com from html string. I know how to do it using Jsoup, but i don't want to parse html, i think that what i want can be reached using regexp.

For example i have string:

<p> Hello</p> <a href="http://my-domain"> My site</a> and <a href="http://google.com> Google </a>

After replacing my string should looks like:

<p> Hello</p> and <a href="http://google.com> Google </a>

Can you please help me with regexp to acheive this result?

"I know how to do it using Jsoup, but i don't want to parse html, i think that what i want can be reached using regexp." but why do you want to torture yourself with regexp? There are many traps regex solution can fall in which are already avoided by HTML parsers. — Pshemo, Jul 10 '18 at 08:14
Beyond that: when regular expressions are your preferred solution, why do you need to rely on *other* people to build them? Meaning: it is your code, your project. You will be the one on the spot to fix bugs or enhance features. But then you need to talk to other people to create the regular expressions for you? That doesn't sound like a sustainable plan to me ... use appropriate tools to solve problems, and ensure that *you* master these tools... — GhostCat, Jul 10 '18 at 08:16
As @Pshemo says, learn to do it the appropriate way. If you still need more convincing: https://stackoverflow.com/a/1732454/2545439 — Pieter De Bie, Jul 10 '18 at 08:18
Related: [Can you provide some examples of why it is hard to parse XML and HTML with a regex?](https://stackoverflow.com/q/701166) — Pshemo, Jul 10 '18 at 08:28
Thanks to all. Yes, i agree that i need to learn regexp more( — Darthoo, Jul 10 '18 at 08:33

score 1 · Accepted Answer · answered Jul 10 '18 at 08:20

1

    String html = "<p> Hello</p> <a href=\"http://my-domain\"> My site</a> and <a href=\"http://google.com\"> Google </a>";
    System.out.println(html.replaceAll("<a href=\"http://my-domain\">.*?</a>", ""));

answered Jul 10 '18 at 08:20

Ralf Renz

1,061
5
7

And now lets say that HTML contains links with `title` attribute `...`. Or that attribute values are set using `'` instead of `"`. Or that there is a JavaScript which includes string containing `...` which shouldn't be modified because it doesn't belong to DOM. There are plenty of problems already solved by HTML parser which regex can easily miss... – Pshemo Jul 10 '18 at 08:27
1

@Pshemo Yes, you are right. But for my case i'll always have plain anchor, that is why i want to use regexep instead of parsing html. If logic will change to more harder, for example to cases that you explain i'll use parsers like jsoup instead of regexp – Darthoo Jul 10 '18 at 08:32

Java, remove link to certain web site from html using regexp

1 Answers1