0

Following is the example amazon link i am trying to crawl for the image's width and height:

http://images.amazon.com/images/P/0099441365.01.SCLZZZZZZZ.jpg

I am using jsoup and following is my code:

import java.io.*;
import org.jsoup.*;
import org.jsoup.nodes.Document;
import org.jsoup.select.Elements;
public class Crawler_main {

/**
 * @param args
 */
public static void main(String[] args) {
    // TODO Auto-generated method stub
    String filepath = "C:/imagelinks.txt";
    try (BufferedReader br = new BufferedReader(new FileReader(filepath))) {
        String line;
        String width;
        //String height;
        while ((line = br.readLine()) != null) {
           // process the line.
            System.out.println(line);
            Document doc = Jsoup.connect(line).ignoreContentType(true).get();
            //System.out.println(doc.toString());
            Elements jpg = doc.getElementsByTag("img");
            width = jpg.attr("width");
            System.out.println(width);
            //String title = doc.title();
        }
    }
    catch (FileNotFoundException ex){
        System.out.println("File not found");
    }
    catch(IOException ex){
        System.out.println("Unable to read line");
    }
    catch (Exception ex){
        System.out.println("Exception occured");
    }
}

}

The html is fetched but when I extract the width attribute, it returns a null. When I printed the html which was fetched, it contains garbadge characters (i am guessing its the actual image information which I am calling garbadge characters. For example:

I cant even paste the document.toString() result in this editor. Help!

Zeeshan Arif
  • 467
  • 4
  • 14

1 Answers1

1

The problem is that you're fetching the jpg file, not any HTML. The call to ignoreContentType(true) provides a clue, as its documentation states:

Ignore the document's Content-Type when parsing the response. By default this is false, an unrecognised content-type will cause an IOException to be thrown. (This is to prevent producing garbage by attempting to parse a JPEG binary image, for example.)

If you want to obtain the width of the actual jpg file, this SO answer may be of use:

BufferedImage bimg = ImageIO.read(new File(filename));
int width          = bimg.getWidth();
int height         = bimg.getHeight();
Community
  • 1
  • 1
Bas
  • 123
  • 2
  • 7
  • 3
    ++ but with this solution you will have to read entire image before you will be able to get its sizes (which for big images like this one http://upload.wikimedia.org/wikipedia/commons/3/3f/Fronalpstock_big.jpg can take a while). To avoid it use approach from this answer: http://stackoverflow.com/a/1560052/1393766. As argument of `ImageIO.createImageInputStream` we can use `new Url(link).openStream()`. – Pshemo Apr 26 '15 at 09:00