0

I have a regex to get the src and the remaining attributes for all the images present in the content.

<img *((.|\s)*?) *src *= *['"]([^'"]*)['"] *((.|\s)*?) */*>

If the content I am matching against is like

<img src=src1"/> <img src=src2"/>

the find(index) hangs and I see the following in the thread dump

at java.util.regex.Pattern$LazyLoop.match(Pattern.java:4357)
    at java.util.regex.Pattern$GroupTail.match(Pattern.java:4227)
    at java.util.regex.Pattern$BranchConn.match(Pattern.java:4078)
    at java.util.regex.Pattern$CharProperty.match(Pattern.java:3345)
    at java.util.regex.Pattern$Branch.match(Pattern.java:4114)
    at java.util.regex.Pattern$GroupHead.match(Pattern.java:4168)
    at java.util.regex.Pattern$LazyLoop.match(Pattern.java:4357)
    at java.util.regex.Pattern$GroupTail.match(Pattern.java:4227)
    at java.util.regex.Pattern$BranchConn.match(Pattern.java:4078)
    at java.util.regex.Pattern$CharProperty.match(Pattern.java:3345)
    at java.util.regex.Pattern$Branch.match(Pattern.java:4114)
    at java.util.regex.Pattern$GroupHead.match(Pattern.java:4168)
    at java.util.regex.Pattern$LazyLoop.match(Pattern.java:4357)
    at java.util.regex.Pattern$GroupTail.match(Pattern.java:4227)
    at java.util.regex.Pattern$BranchConn.match(Pattern.java:4078)
    at java.util.regex.Pattern$CharProperty.match(Pattern.java:3345)
    at java.util.regex.Pattern$Branch.match(Pattern.java:4114)
    at java.util.regex.Pattern$GroupHead.match(Pattern.java:4168)
    at java.util.regex.Pattern$LazyLoop.match(Pattern.java:4357)
    at java.util.regex.Pattern$GroupTail.match(Pattern.java:4227)
    at java.util.regex.Pattern$BranchConn.match(Pattern.java:4078)
    at java.util.regex.Pattern$CharProperty.match(Pattern.java:3345) 

Is there a solution or a workaround for solving this issue?

VLAZ
  • 26,331
  • 9
  • 49
  • 67
  • You may wish to read http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 before continuing. – Dawood ibn Kareem Oct 01 '13 at 21:26
  • Duplicate of http://stackoverflow.com/a/2408599/20938) Never, ever use `(.|\s)` in a regex. Just specify DOTALL mode and use `.` by itself. – Alan Moore Oct 01 '13 at 21:56
  • The attribute values in your example are missing the opening quotes. I hope that's just a typo you introduced in the question. – Alan Moore Oct 01 '13 at 22:03
  • Regex is not suited to parse HTML; use a htmlparser. – Firas Dib Oct 01 '13 at 22:04
  • @Alan ... yes the example is missing quotes and that's why the regex should not find the src and other attributes. But it hangs up. – user2836528 Oct 01 '13 at 22:10

2 Answers2

1

A workaround is to use a HTML parser such as JSoup, for example

Document doc = 
      Jsoup.parse("<html><img src=\"src1\"/> <img src=\"src2\"/></html>");
Elements elements = doc.select("img[src]");
for (Element element: elements) {
    System.out.println(element.attr("src"));
    System.out.println(element.attr("alt"));
    System.out.println(element.attr("height"));
    System.out.println(element.attr("width"));
}
Reimeus
  • 158,255
  • 15
  • 216
  • 276
0

It looks like what you've got is an "evil regex", which is not uncommon when you try to construct a complicated regex to match one thing (src) within another thing (img). In particular, evil regexs usually happen when you try to apply repetition to a complex subexpression, which you are doing with (.|\s)*?.

A better approach would be to use two regexes; one to match all <img> tags, and then another to match the src attribute within it.

My Java's rusty, so I'll just give you the pseudocode solution:

foreach( imgTag in input.match( /<img .*?>/ig ) ) {
    src = imgTag.match( /\bsrc *= *(['\"])(.*?)\1/i );
    // if you want to get other attributes, you can do that the same way:
    alt = imgTag.match( /\balt *= *(['\"])(.*?)\1/i );
    // even better, you can get all the attributes in one go:
    attrs = imgTag.match( /\b(\w+) *= *(['\"])(.*?)\2/g );
    // attrs is now an array where the first group is the attr name
    // (alt, height, width, src, etc.) and the second group is the
    // attr value
}

Note the use of a backreference to match the appropriate type of closing quote (i.e., this will match src='abc' and src="abc". Also note that the quantifiers are lazy here (*? instead of just *); this is necessary to prevent too much from being consumed.

EDIT: even though my Java's rusty, I was able to crank out an example. Here's the solution in Java:

import java.util.regex.*;

public class Regex {

    public static void main( String[] args ) {
        String input = "<img alt=\"altText\" src=\"src\" height=\"50\" width=\"50\"/> <img alt='another image' src=\"foo.jpg\" />";
        Pattern attrPat = Pattern.compile( "\\b(\\w+) *= *(['\"])(.*?)\\2" );
        Matcher imgMatcher = Pattern.compile( "<img .*?>" ).matcher( input );
        while( imgMatcher.find() ) {
            String imgTag = imgMatcher.group();
            System.out.println( imgTag );
            Matcher attrMatcher = attrPat.matcher( imgTag );
            while( attrMatcher.find() ) {
                String attr = attrMatcher.group(1);
                System.out.format( "\tattr: %s, value: %s\n", attrMatcher.group(1), attrMatcher.group(3) );
            }
        }
    }
}
Ethan Brown
  • 26,892
  • 4
  • 80
  • 92
  • I don't really see how this is an "evil regex".. Care to explain? You can look at some debugging output here: http://regex101.com/r/wH4rD7/#debugger – Firas Dib Oct 01 '13 at 21:52
  • I am not looking for just the src. I need other attributes too (before and after the src). For eg. altText – user2836528 Oct 01 '13 at 21:54
  • If you look carefully, Lindrian, I linked "evil regex" above. That'll explain all about evil regexes. As for your second comment, you can also pull whatever else you need inside the body. Not only will my approach work, it'll work even better. I'll update my answer to indicate how you can get all of the attributes. – Ethan Brown Oct 01 '13 at 22:58
  • Edited my answer with a Java solution instead of just pseudocode. – Ethan Brown Oct 01 '13 at 23:30