-1

I need to extract all links at images in a document HTML. Actually I can extract the href attribute value using this RegExp:

private static final String HTML_A_HREF_TAG_PATTERN = 
    "\\s*(?i)href\\s*=\\s*(\"([^\"]*\")|'[^']*'|([^'\">\\s]+))";

I need to fetch only the link to images, so I have to check, if the string match with images extension (jpg, jpeg, png, gif).

How can I modify my pattern?

cesare
  • 2,098
  • 19
  • 29

3 Answers3

3

Please refrain from using regular expressions to extract data from HTML. You'll find plenty of reasons why on this site.

In your case, you could use JSoup to go through the HTML source of the page and process the required information, as per the below example (adapted from here):

    Document doc = Jsoup.connect(url).get();
    Elements media = doc.select("[src]");

    Pattern imageExt = Pattern.compile("((jpe?g)|(png)|(gif))$");
    for(Element src : media)    {
        Matcher m = imageExt.matcher(src.attr("abs:src"));
        if(m.find())    {
            System.out.println(src.attr("abs:src"));
        }

    }
Community
  • 1
  • 1
npinti
  • 51,780
  • 5
  • 72
  • 96
1

You should use regex similar to this:

<[^>]+href\s*=\s*['"]([^'"]+.(?:jpg|png|jpeg|gif))['"][^>]*>

Explanation

<[^>]+: open tag that follows by any characters except close tag >,

href\s*=\s*: text href follows by = symbol,

['"]: open string symbol ' or ",

([^'"]+.(?:jpg|png|jpeg|gif)): string composes of any characters except ' or " with image extension,

['"]: close string symbol ' or ",

[^>]*>: any possible string until close tag.

This quite similar to your regex, I'm not sure that this one can work nicely with Java or not but I tried to test it on online Java regex tester already.

fronthem
  • 4,011
  • 8
  • 34
  • 55
  • Even for a regex the use of `[^>]` alone is a poor way to do it. Matches `` but not `` Might as well not even try to do tag parsing. –  Aug 24 '15 at 19:41
  • `<(?!\!--)[^>]+href\s*=\s*['"]([^'"]+.(?:jpg|png|jpeg|gif))['"][^>]*>` might be help. – fronthem Aug 24 '15 at 20:01
1

Disclaimer - To parse html with regex is not recommended!

Although imperfect, this might work. The link is in capture group 2.

 # "(?si)<[\\w:]+(?=(?:[^>\"']|\"[^\"]*\"|'[^']*')*?(?<=\\s)href\\s*=\\s*(?:(['\"])\\s*((?:(?!\\1).)*?\\.(?:jpg|png|jpeg|gif))\\s*\\1))\\s+(?:\".*?\"|'.*?'|[^>]*?)+>"

 (?si)
 < [\w:]+ 
 (?=
      (?: [^>"'] | " [^"]* " | ' [^']* ' )*?
      (?<= \s )
      href \s* = \s* 
      (?:
           ( ['"] )                      # (1)
           \s* 
           (                             # (2 start)
                (?:
                     (?! \1 )
                     . 
                )*?
                \.
                (?: jpg | png | jpeg | gif )
           )                             # (2 end)
           \s* 
           \1 
      )
 )
 \s+ 
 (?: " .*? " | ' .*? ' | [^>]*? )+
 >