0

i have this weird problem. I have this Java method that works fine in my program:

/*
* Extract all image urls from the html source code
*/
public void extractImageUrlFromSource(ArrayList<String> imgUrls, String html) {
    Pattern pattern = Pattern.compile("\\<[ ]*[iI][mM][gG][\t\n\r\f ]+.*[sS][rR][cC][ ]*=[ ]*\".*\".*>");
    Matcher matcher = pattern.matcher(html);
    while (matcher.find()) {
        imgUrls.add(extractImgUrlFromTag(matcher.group()));
    }
}

This method works fine in my java application. But whenever I test it in JUnit test, it only adds the last url to the ArrayList

/**
 * Test of extractImageUrlFromSource method, of class ImageDownloaderProc.
 */
@Test
public void testExtractImageUrlFromSource() {
    System.out.println("extractImageUrlFromSource");
    String html = "<html><title>fdjfakdsd</title><body><img kfjd src=\"http://image1.png\">df<img dsd src=\"http://image2.jpg\"></body><img dsd src=\"http://image3.jpg\"></html>";
    ArrayList<String> imgUrls = new ArrayList<String>();
    ArrayList<String> expimgUrls = new ArrayList<String>();
    expimgUrls.add("http://image1.png");
    expimgUrls.add("http://image2.jpg");
    expimgUrls.add("http://image3.jpg");
    ImageDownloaderProc instance = new ImageDownloaderProc();
    instance.extractImageUrlFromSource(imgUrls, html);
    imgUrls.stream().forEach((x) -> {
        System.out.println(x);
    });
    assertArrayEquals(expimgUrls.toArray(), imgUrls.toArray());
}

Is it the JUnit that has the fault. Remember, it works fine in my application.

2 Answers2

0

I wish I could comment as I'm not sure about this, but it might be worth mentioning...

This line looks like it's extracting the URLs from the wrong array...did you mean to extract from expimgUrls instead of imgUrls?

instance.extractImageUrlFromSource(imgUrls, html);

I haven't gotten this far in my Java education so I may be incorrect...I just looked over the code and noticed it. I hope someone else who knows more can actually give you a solid answer!

0

I think there is a problem in the regex:

  "\\<[ ]*[iI][mM][gG][\t\n\r\f ]+.*[sS][rR][cC][ ]*=[ ]*\".*\".*>"

The problem (or at least one problem) us the first .*. The + and * metacharacters are greedy, which means that they will attempt to match as many characters as possible. In your unit test, I think that what is happening is that the .* is matching everything up to the last 'src' in the input string.

I suspect that the reason that this "works" in your application is that the input data is different. Specifically, I suspect that you are running your application on input files where each img element is on a different line. Why does this make a difference? Well, it turns out that by default, the . metacharacter does not match line breaks.


For what it is worth, using regexes to "parse" HTML is generally thought to be a bad idea. For a start, it is horribly fragile. People who do a lot of this kind of stuff tend to use proper HTML parsers ... like "jsoup".

Reference: RegEx match open tags except XHTML self-contained tags

Community
  • 1
  • 1
Stephen C
  • 698,415
  • 94
  • 811
  • 1,216
  • The reason for [\t\n\r\f ]+ is that there can be at least one whitespace from img. The reason for .* after that is sometimes you can put width, height and alt attribute of image tag before src – rexwynnohay Aug 10 '14 at 04:36
  • You need to replace the ".*" with something that won't consume too much. The same goes for the following ones. Hint: they should not consume a `>`. – Stephen C Aug 10 '14 at 04:37