Saving the first Image from URL

Question

Here's my problem. I have a txt file called "sites.txt" . In these i type random internet sites. My Goal is to save the first image of each site. I tried to filter the Server response by the img tag and it actually works for some sites, but for some not.

The sites where it works the img src starts with http:// ... the sites it doesnt work start with anything else.

I also tried to add the http:// to the img src images which didnt have it, but i still get the same error:

    Exception in thread "main" java.net.MalformedURLException: no protocol:
    at java.net.URL.<init>(Unknown Source)

My current code is:

    public static void main(String[] args) throws IOException{
    try {
        File file = new File ("sites.txt");
        Scanner scanner = new Scanner (file);
        String url;
        int counter = 0;
            while(scanner.hasNext()) 
                {   
                    url=scanner.nextLine();
                    URL page = new URL(url);
                    URLConnection yc = page.openConnection();
                       BufferedReader in = new BufferedReader(new InputStreamReader(yc.getInputStream()));
                       String inputLine = in.readLine();
                       while (!inputLine.toLowerCase().contains("img"))inputLine = in.readLine();
                       in.close();
                       String[] parts = inputLine.split(" ");
                       int i=0;
                       while(!parts[i].contains("src"))i++;
                       String destinationFile = "image"+(counter++)+".jpg";
                       saveImage(parts[i].substring(5,parts[i].length()-1), destinationFile);
                       String tmp=scanner.nextLine();
                       System.out.println(url);

                }
        scanner.close();
        }
            catch (FileNotFoundException e) 
            {
                System.out.println ("File not found!");
                System.exit (0);
            }

}

public static void saveImage(String imageUrl, String destinationFile) throws IOException {
    // TODO Auto-generated method stub
    URL url = new URL(imageUrl);
    String fileName = url.getFile();
    String destName = fileName.substring(fileName.lastIndexOf("/"));
    System.out.println(destName);
    InputStream is = url.openStream();
    OutputStream os = new FileOutputStream(destinationFile);

    byte[] b = new byte[2048];
    int length;

    while ((length = is.read(b)) != -1) {
        os.write(b, 0, length);
    }

    is.close();
    os.close();
}

I also got a tip to use the apache jakarte http client libraries but i got absolutely no idea how i could use those i would appreciate any help.

You can dig into some examples from: http://hc.apache.org/httpcomponents-client-ga/examples.html — px5x2, May 13 '14 at 20:32
You'd benefit from using a library like [jsoup](http://jsoup.org/) which makes parsing HTML very easy. Note that you'll not only run into image URLs that are missing a scheme, but you'll also run into **relative** paths, which you'll have to append to the site's URL in order to get. For example, you'll see `'`, which you'd need to append to `"https://www.google.com/"`. — sgbj, May 13 '14 at 20:40
can you show us some samples of URL placed in your text file. — Braj, May 13 '14 at 20:40

Qix - MONICA WAS MISTREATED · Answer 1 · 2014-05-13T22:04:05.763

3

A URL (a type of URI) requires a scheme in order to be valid. In this case, http.

When you type www.google.com into your browser, the browser is inferring you mean http:// and automatically prepends it for you. Java doesn't do this, hence your exception.

Make sure you always have http://. You can easily fix this using regex:

String fixedUrl = stringUrl.replaceAll("^((?!http://).{7})", "http://$1");

or

if(!stringUrl.startsWith("http://"))
    stringUrl = "http://" + stringUrl;

edited May 13 '14 at 22:04

answered May 13 '14 at 20:29

Qix - MONICA WAS MISTREATED

14,451
16
82
145

Look at the OP post again **I also tried to add the http:// to the img src images which didnt have it, but i still get the same error:** – Braj May 13 '14 at 20:38
`The sites where it works the img src starts with http://` and `MalformedURLException: no protocol`. This is clearly his issue. – Qix - MONICA WAS MISTREATED May 13 '14 at 20:42
2

Hi im atm on phone so can't check stuff out till tomorrow but thanks for your lightning fast answer. I also tried to add http:// with if(!parts[i].startsWith("http:"))parts[i]="http:" + parts[i] ; with the same error – user3634163 May 13 '14 at 20:56

Braj · Answer 2 · 2014-05-13T21:53:41.473

1

An alternative solution

Simply try with ImageIO that contains static convenience methods for locating ImageReaders and ImageWriters, and performing simple encoding and decoding.

Sample code:

// read a image from the URL
// I used the URL that is your profile pic on StackOverflow
BufferedImage image = ImageIO
        .read(new URL(
                "https://www.gravatar.com/avatar/3935223a285ab35a1b21f31248f1e721?s=32&d=identicon&r=PG&f=1"));

// save the image
ImageIO.write(image, "jpg", new File("resources/avatar.jpg"));

edited May 13 '14 at 21:53

answered May 13 '14 at 20:32

Braj

46,415
5
60
76

1

That's not his problem at all. – Qix - MONICA WAS MISTREATED May 13 '14 at 20:32
This isn't solving the problem at all. It's not an *alternate solution* if the code you're suggesting already works the way he has it. – Qix - MONICA WAS MISTREATED May 13 '14 at 20:35
Go look at his exception again. It has nothing to do with the connection or *how* he downloads the image. – Qix - MONICA WAS MISTREATED May 13 '14 at 20:35
`ImageIO` isn't a library. – Qix - MONICA WAS MISTREATED May 13 '14 at 20:42
Because it's not entirely pertinent to his question. This should be a comment. – Qix - MONICA WAS MISTREATED May 13 '14 at 20:43
It doesn't answer his question, but provides an alternative to something that is otherwise completely innocuous to his real issue. Feel feel to post a question on Meta about this. The comment section is not a place for debate. – Qix - MONICA WAS MISTREATED May 13 '14 at 20:47
@Qix I am really Sorry for all my words. I take me words back. We are not here to fight. I apologize for my mistake. I am deleting all the conversation that of no sense at all. :) – Braj May 13 '14 at 21:49

score 0 · Answer 3 · edited May 23 '17 at 10:25

When you're scraping the site's HTML for image elements and their src attributes, you'll run into several different representations of URLs.

Some examples are:

resource = https://google.com/images/srpr/logo9w.png
resource = google.com/images/srpr/logo9w.png
resource = //google.com/images/srpr/logo9w.png
resource = /images/srpr/logo9w.png
resource = images/srpr/logo9w.png

For the second through fifth ones, you'll need to build the rest of the URL.

The second one may be more difficult to differentiate from the fourth and fifth ones, but I'm sure there are workarounds. The URL Standard leads me to believe you won't see it as often, because I don't think it's technically valid.

The third case is pretty simple. If the resource variable starts with //, then you just need to prepend the protocol/scheme to it. You can do this with the site object you have:

url = site.getProtocol() + ":" + resource

For the fourth and fifth cases, you'll need to prepend the resource with the entire site's URL.

Here's a sample application that uses jsoup to parse the HTML, and a simple utility method to build the resource URL. You're interested in the buildResourceUrl method. Also, it doesn't handle the second case; I'll leave that to you.

import java.io.*;
import java.net.*;
import org.jsoup.*;
import org.jsoup.nodes.*;
import org.jsoup.select.*;

public class SiteScraper {

    public static void main(String[] args) throws IOException {
        URL site = new URL("https://google.com/");
        Document doc = Jsoup.connect(site.toString()).get();
        Elements images = doc.select("img");
        for (Element image : images) {
            String src = image.attr("src");
            System.out.println(buildResourceUrl(site, src));
        }
    }

    static URL buildResourceUrl(URL site, String resource) 
            throws MalformedURLException {
        if (!resource.matches("^(http|https|ftp)://.*$")) {
            if (resource.startsWith("//")) {
                return new URL(site.getProtocol() + ":" + resource);
            } else {
                return new URL(site.getProtocol() + "://" + site.getHost() + "/" 
                        + resource.replaceAll("^/", ""));
            }
        }
        return new URL(resource);
    }
}

This obviously won't cover everything, but it's a start. You may run into problems when the URL you're trying to access is in a subdirectory of the root of the site (i.e., http://some.place/under/the/rainbow.html). You may even encounter base64 encoded data URI's in the src attribute... It really depends on the individual case and how far you're willing to go.

Saving the first Image from URL

3 Answers3