0

I have the following code that gets the a href tags urls from an XML which is working correctly:

Pattern p = Pattern.compile("<a[^>]+href\\s*=\\s*['\"]([^'\"]+)['\"][^>]*>");
Matcher m = p.matcher(xmlString);
while (m.find())
    imagesURLs.add(m.group(1));

I have the following:

<a href="http://...">some text</a>

The top code gets me <a href="http://..."> in m.group(0) and http://... in m.group(1).

I also want to get the full <a href="http://...">some text</a>.

How can achieve this by modifying the regex?

Ali
  • 56,466
  • 29
  • 168
  • 265
hasan
  • 23,815
  • 10
  • 63
  • 101
  • 2
    Use a HTML parser instead – PeeHaa May 17 '14 at 14:19
  • Can you explain why I want to do that? is it faster? – hasan May 17 '14 at 14:20
  • 1
    Because regex is brittle and unmaintainable (for this purpose) – PeeHaa May 17 '14 at 14:21
  • 1
    Add another capturing group `(.*?)` to the end and use with DOTALL [modifier](http://www.regular-expressions.info/modifiers.html) , for making the `.` also match newlines (see [example](http://regex101.com/r/qM0vB8)). Using a [lazy](http://www.regular-expressions.info/repeat.html) quantifier. – Jonny 5 May 17 '14 at 14:39
  • add it as an answer so I can appropriate it :) – hasan May 17 '14 at 15:32
  • 2
    [You can't parse HTML with Regex](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454) – Jakob Weisblat May 17 '14 at 16:04
  • [Oh yes you can parse HTML with Patterns](http://stackoverflow.com/a/4234491/471272). – tchrist Jun 06 '14 at 22:35

1 Answers1

1

With all the disclaimers about using regex to parse html: you can use this

(?is)(<a[^>]+href\s*=\s*(['"])([^'"]+)\2[^>]*>).*?</a>
  1. Group 0 is the entire match: <a href="http://...">some text</a>
  2. Group 1 is the opening tag: <a href="http://...">
  3. Group 2 is something I added to ensure that your opening quote is the same kind as your closing quote. Ignore it. 4 Group 3 is the url: http://...

See the groups in this demo

To use in Java, as you know, you need to escape some characters. Something like:

Pattern p = Pattern.compile("(?is)(<a[^>]+href\\s*=\\s*(['\"])([^'\"]+)\\2[^>]*>).*?</a>");
zx81
  • 41,100
  • 9
  • 89
  • 105