3

I wanted to extract Url of image from html code, e.g. html code below:

<div class="imageContainer">
   <img src="http://ecx.images-amazon.com/images/I/41%2B7N48F7JL._SL135_.jpg"
      alt="" width="135" height="94"
      style="margin-top: 21px; margin-bottom:20px;" /></div>

And I got a code from net

String regexImage = "(?<=<img (*)src=\")[^\"]*";
Pattern pImage = Pattern.compile(regexImage);
Matcher mImage = pImage.matcher(elementString);
while (mImage.find()) {
   String imagePath = mImage.group();}

which is working and has re(regular expression)

"(?<=<img src=\")[^\"]*"

But now I want to extract image url from html code like below :

<img onerror="img_onerror(this);" data-logit="true" data-pid="MOBDDDBRHVWQZHYY"
   data-imagesize="thumb"
   data-error-url="http://img1a.flixcart.com/mob/thumb/mobile.jpg"
   src="http://img8a.flixcart.com/image/mobile/h/y/y/samsung-galaxy-s-duos-s7562-125x125-imadddczzr4qhqnc.jpeg"
   alt="Samsung Galaxy S Duos S7562: Mobile"
   title="Samsung Galaxy S Duos S7562: Mobile"></img></a>
<div class="bp-offer-image image-offer"></div>

where there is code between img and src=

I'm trying the regular expression as "(?<=<img (*)src=\")[^\"]*" but its not working. So please give me regular expression so that i can extract image url i.e. http://ecx.images-amazon.com/images/I/61xqOQ3Sj8L._SL135_.jpg from above html code.

And, first I'm using Jsoup to parse html to extract tags containing img :

doc = Jsoup.connect(urlFromBrowse).get();
            Elements elements = doc.getElementsByTag("img");

            for (Element element : elements) {
                String elementString = element.toString();

and passed this elementString to matcher() meathod. And from the tag(element) that I'm getting, I'm using regular expression to parse image url, name etc things.

Alan Moore
  • 73,866
  • 12
  • 100
  • 156
  • 3
    Don't use Regex. Parse it as html code. – Anirudh Ramanathan Oct 31 '12 at 15:18
  • http://stackoverflow.com/questions/590747/using-regular-expressions-to-parse-html-why-not – André Stannek Oct 31 '12 at 15:22
  • 2
    Parsing well formed html is easy but if isn't well formed it's a nightmare! – Aubin Oct 31 '12 at 15:23
  • Just saw this on the front page. Surely Java has some DOM parser. Investigate this, rather than regex. – Joel Berger Oct 31 '12 at 15:23
  • @Cthulhu please see question because I have edited it. And now tell me, am I doing wrong by parsing it. –  Oct 31 '12 at 15:27
  • @Aubin :), yes, I am really tired and frustrated now –  Oct 31 '12 at 15:28
  • thanks for your comments, but hey guys, for 2 weeks I'm working on it, now almost done what I wanted, just remaining is that one correct regular expression. And now I can't start my work from scratch to parse it using other techniques. So anyone please give me that regular expression –  Oct 31 '12 at 15:30
  • No, you are already using a DOM parser, so use it. Why do it incorrectly when you almost have it right? – Joel Berger Oct 31 '12 at 15:32

3 Answers3

5

This post is an answer to the question, not a guideline.

The question was not "RegExp vs DOM", the question was "Regular expression to extract image url from html code".

Here it is:

String htmlFragment =
   "<img onerror=\"img_onerror(this);\" data-logit=\"true\" data-pid=\"MOBDDDBRHVWQZHYY\"\n" + 
   "   data-imagesize=\"thumb\"\n" + 
   "   data-error-url=\"http://img1a.flixcart.com/mob/thumb/mobile.jpg\"\n" + 
   "   src=\"http://img8a.flixcart.com/image/mobile/h/y/y/samsung-galaxy-s-duos-s7562-125x125-imadddczzr4qhqnc.jpeg\"\n" + 
   "   alt=\"Samsung Galaxy S Duos S7562: Mobile\"\n" + 
   "   title=\"Samsung Galaxy S Duos S7562: Mobile\"></img></a>";
Pattern pattern =
   Pattern.compile( "(?m)(?s)<img\\s+(.*)src\\s*=\\s*\"([^\"]+)\"(.*)" );
Matcher matcher = pattern.matcher( htmlFragment );
if( matcher.matches()) {
   System.err.println(
      "OK:\n" +
      "1: '" + matcher.group(1) + "'\n" +
      "2: '" + matcher.group(2) + "'\n" +
      "3: '" + matcher.group(3) + "'\n" );
}

and the ouput:

OK:
1: 'onerror="img_onerror(this);" data-logit="true" data-pid="MOBDDDBRHVWQZHYY"
   data-imagesize="thumb"
   data-error-url="http://img1a.flixcart.com/mob/thumb/mobile.jpg"
   '
2: 'http://img8a.flixcart.com/image/mobile/h/y/y/samsung-galaxy-s-duos-s7562-125x125-imadddczzr4qhqnc.jpeg'
3: '
   alt="Samsung Galaxy S Duos S7562: Mobile"
   title="Samsung Galaxy S Duos S7562: Mobile"></img></a>'
Aubin
  • 14,617
  • 9
  • 61
  • 84
  • less snarky: how does this handle old, ill-formed HTML better than a DOM parser? How do you know the DOM parser doesn't handle it in the first place? – Joel Berger Oct 31 '12 at 15:50
  • hey, you are great, but I think you have made 1 mistake. Because your code works great here, but not in my code. And reason behind it is(I think that), you have used \ before every ", and designed regular expression for it, but in code there are no \. So please give regular expression for it, you are my last hope –  Oct 31 '12 at 16:37
  • The example code you give contains ". Please, give the url of the real HTML source. – Aubin Oct 31 '12 at 16:49
  • I'm trying a application like http://pinterest.com/. The above html code sample is a tag from html from http://www.amazon.com/ –  Nov 01 '12 at 05:22
  • `if( matcher.matches() )` may be incorrect in the above example; should it be `while( matcher.find() ) `? – Andrew Wyld Feb 10 '13 at 13:33
  • @Aubin How can I change the src for multiple different images in java using jsoup? Can you help me with this question?http://stackoverflow.com/questions/39095067/how-to-change-src-for-multiple-images-in-android – AndroidNewBee Aug 23 '16 at 09:08
2

According to the docs JSoup (a DOM parser) can easily get the attribute after you have gotten the tag element. Something like

doc.getElementsByTag("img").attr("src")

ought to work.

For the record I'm a Perl guy, a community that often reaches for regexes too quickly. I am constantly trying to enlighten people to the joy that is using DOM parsers rather than fragile regexes.

Joel Berger
  • 20,180
  • 5
  • 49
  • 104
  • Yes, use DOM for x-html but for ill-formed HTML (3.2) it's not applicable. – Aubin Oct 31 '12 at 15:37
  • who said anything about ill-formed HTML 3.2? – Joel Berger Oct 31 '12 at 15:39
  • this seems like what I wanted, I will try this and will come back. By the way, thanks –  Oct 31 '12 at 15:44
  • you may have to loop over the tag elements, see the docs for the [`Elements`](http://jsoup.org/apidocs/org/jsoup/select/Elements.html) class for helper methods for this. – Joel Berger Oct 31 '12 at 15:45
  • thanks for the useful link, because in my eclipse I'm not getting any documentation for mouse hover for any of the JSoup methods. –  Nov 01 '12 at 05:27
0

I'd expect you to be able to get the various attributes of the <img> element via the JSoup API. Does Node.attributes() give you what you want ?

Brian Agnew
  • 268,207
  • 37
  • 334
  • 440