1

I'm trying to make something pretty simple, but I simply suck at regular expressions.

My goal is to replace :

<a href="http://www.google.com">Link To Google</a>

To :

<b>Link To Google</b>

In java.

I tried this :

String input = "<a href=\"http://www.google.com\">Link to Google</a>";
String Regex1 = "<a href(.*)>";
String Regex2 = "</a>";
String output = test.replace(Regex1, "<b>");
output = test.replace(Regex2, "</b>");

But the first Regex1 is not matched with my input. Any clue ?

Thanks in advance!

Thordax
  • 1,673
  • 4
  • 28
  • 54
  • I would expect `Regex1` to match the whole `input`, because it's greedy. You need to make it lazy or exclude the `'>'`. – Lev Levitsky Mar 26 '12 at 09:01
  • 2
    You really should not use regular expressions with HTML. *Especially* when you still have to ask questions about regular expressions. Work with an HTML parser like [jsoup](http://jsoup.org/) instead. – Tomalak Mar 26 '12 at 09:01

4 Answers4

2

It matches just fine, even tho it's wrong, and you should not use regex to parse HTML.

You want to make the second replace on the result of the first replace, not the original string:

String output = test.replace(Regex1, "<b>");
output = output.replace(Regex2, "</b>");

You can make it work for your example by using:

String Regex1 = "<a href.*?>";

Which makes the quantifier ungreedy. But this expression will break very easily for the slightest changes in the input HTML, which is (one of the reasons) why you should't use regex to work with HTML.

Some simple examples the above regex would not work for:

<A HREF="http://www.google.com">
<a  href="http://www.google.com">
<a href="http://www.google.com"
>
<a href=">">
Qtax
  • 33,241
  • 9
  • 83
  • 121
1

Use a parser. They are easy to use and always the more correct solution.

jsoup (http://jsoup.org) would handle your task easily like this:

File input = new File("your.html");
Document doc = Jsoup.parse(input, "UTF-8");

Elements links = doc.select("a[href]");

while (links.hasNext()) {
  Element link = iterator.next();
  Element bold = doc.createElement("b").appendText(link.text());
  link.replaceWith(bold);
} 

// now do something with...
// doc.outerHtml()
Tomalak
  • 332,285
  • 67
  • 532
  • 628
  • Isn't that a bit of an overkill for such a small task? – mbatchkarov Mar 26 '12 at 09:23
  • @reseter, is there any other way if doing it correctly? You could probably use a SAX parser in this case instead, to make it more efficient. – Qtax Mar 26 '12 at 09:27
  • @reseter What do you know of the size of the task? Maybe the OP already does all kinds of crazy regex things to the document and this solution would actually streamline the process. Anyway, I prefer *"slow & correct"* (not to forget, maintainable by folks that don't know regex very well) over *"fast & works most of the time, unless the HTML is broken, changes structure or someone meddles with the regex"*. – Tomalak Mar 26 '12 at 09:29
  • @Qtax Sure, but SAX would be more work also. There are other parsers if this particular one isn't the right fit. And "more efficient" is something I'd take care of when I hit a bottleneck caused by this bit of code. – Tomalak Mar 26 '12 at 09:35
0

If you want it to work replace Regex1 with

<a href=\"(.*)\">

And then:

output = output.replace(Regex2,"</b>")
Bogdan Emil Mariesan
  • 5,529
  • 2
  • 33
  • 57
0

Don't know about using regexs in Java but there must be a "capture group" notion:

Your initial regex would be: "<a\s+href\s*=\s*".*?">(.*?)</a>"

That you would replace by: "<b>$1</b>" (where $1 means the group captured between parenthesis in the first regex)

David Brabant
  • 41,623
  • 16
  • 83
  • 111