1

I have the following code

private String anchorRegex = "\\<\\s*?a\\s+.*?href\\s*?=\\s*?([^\\s]*?).*?\\>";
private Pattern anchorPattern = Pattern.compile(anchorRegex, Pattern.CASE_INSENSITIVE);
String content = getContentAsString();
Matcher matcher = anchorPattern.matcher(content);

while(matcher.find()) {
    System.out.println(matcher.group(1));
}

The call to getContentAsString() returns the HTML content from a web page. The problem I'm having is that the only thing that gets printed in my System.out is a space. Can anyone see what's wrong with my regex?

Regex drives me crazy sometimes.

Justin Kredible
  • 8,354
  • 15
  • 65
  • 91

3 Answers3

1

You need to delimit your capturing group from the following .*?. There's probably double quotes " around the href, so use those:

<\s*a\s+.*?href\s*=\s*"(\S*?)".*?>

Your regex contains:

([^\s]*?).*?

The ([^\s]*?) says to reluctantly find all non-whitespace characters and save them in a group. But the reluctant *? depends on the next part, which is .; any character. So the matching of the href aborts at the first possible chance and it is the .*? which matches the rest of the URL.

beerbajay
  • 19,652
  • 6
  • 58
  • 75
1

The regex you should be using is this:

String anchorRegex = "(?s)<\\s*a\\s+.*?href\\s*=\\s*['\"]([^\\s>]*)['\"]";
anubhava
  • 761,203
  • 64
  • 569
  • 643
0

This should be able to pull out the href without too much trouble.
The link is in capture group 2, its expanded and assumes dot-all.
Use Java delimiters as necessary.

(?s)
<a 
  (?=\s) 
  (?:[^>"']|"[^"]*"|'[^']*')*? (?<=\s) href \s*=\s* (['"]) (.*?) \1 
  (?:".*?"|'.*?'|[^>]*?)+ 
>

or not expanded, not dot-all.

<a(?=\s)(?:[^>"']|"[^"]*"|'[^']*')*?(?<=\s)href\s*=\s*(['"])([\s\S]*?)\1(?:"[\s\S]*?"|'[\s\S]*?'|[^>]*?)+>