1

How can I match the href and 'a' vlaue in a link ?

So extract 'www.google.com' & 'test' from below :

<A HREF="www.google.com/test.html" title="test">test</A>

Here is what I am trying : '<A HREF=(.+).html' but it is not matching ?

Andy Lester
  • 91,102
  • 13
  • 100
  • 152
blue-sky
  • 51,962
  • 152
  • 427
  • 752
  • 2
    Do NOT use regular expressions for parsing HTML. There are plenty of HTML parsers out there for various languages. Which one are you using? – pemistahl Jan 16 '13 at 21:12
  • 1
    To the user's defense, sometimes all you want is a quick dirty regex because you're processing something one-off and you know the tags are always structured in a particular way... But the given regex is not a very good start for the problem at hand. – paddy Jan 16 '13 at 21:16
  • 2
    Things never end up as easy as they start off, but a regex for _this exact case_ would be something like [`\(.*\)`](http://refiddle.com/gjv). Use at own peril :) – Joachim Isaksson Jan 16 '13 at 21:37
  • @Peter Stahl im using it for scala – blue-sky Jan 16 '13 at 21:50
  • @Joachim Isaksson put your last comment in an answer and ill accept ? – blue-sky Jan 16 '13 at 22:39
  • @PeterStahl, most often than not you would be right. However I've used regex successfully many times for quick and dirty job. This is usually much faster than wiring up an html parser. And sometimes it's all that is required. – Andrew Savinykh Jan 17 '13 at 04:46

3 Answers3

1

Try this:

<A.*HREF\s*=\s*(?:"|')([^"']*)(?:"|').*>(.*)<\/A>

Group1 and Group2 will give you the desired result.

prageeth
  • 7,159
  • 7
  • 44
  • 72
  • Note that it will ONLY work on this one specific tag, which clearly isn't even a real example because the URL is incorrect. – Andy Lester Jan 17 '13 at 04:31
1

Regular expressions for HTML can be brittle to change, but a regex for this exact case would be;

<A HREF="\(.*\)" .*>\(.*\)</A>

trincot
  • 317,000
  • 35
  • 244
  • 286
Joachim Isaksson
  • 176,943
  • 25
  • 281
  • 294
  • The link to refiddle.com [should be replaced](https://meta.stackoverflow.com/q/424469/5459839) (I removed the link), but this regex cannot be right -- surely the parentheses should not be escaped (is this a relic from how SO formatting previously needed escapes here?), and the `.*` should not be greedy, and in some environments the `/` should be escaped. – trincot May 02 '23 at 14:55
0

Because the text html does not appear in your tag.....

paddy
  • 60,864
  • 6
  • 61
  • 103