-1

Here's my problem. I have a txt file called "sites.txt" . In these i type random internet sites. My Goal is to save the first image of each site. I tried to filter the Server response by the img tag and it actually works for some sites, but for some not.

The sites where it works the img src starts with http:// ... the sites it doesnt work start with anything else.

I also tried to add the http:// to the img src images which didnt have it, but i still get the same error:

    Exception in thread "main" java.net.MalformedURLException: no protocol:
    at java.net.URL.<init>(Unknown Source)

My current code is:

    public static void main(String[] args) throws IOException{
    try {
        File file = new File ("sites.txt");
        Scanner scanner = new Scanner (file);
        String url;
        int counter = 0;
            while(scanner.hasNext()) 
                {   
                    url=scanner.nextLine();
                    URL page = new URL(url);
                    URLConnection yc = page.openConnection();
                       BufferedReader in = new BufferedReader(new InputStreamReader(yc.getInputStream()));
                       String inputLine = in.readLine();
                       while (!inputLine.toLowerCase().contains("img"))inputLine = in.readLine();
                       in.close();
                       String[] parts = inputLine.split(" ");
                       int i=0;
                       while(!parts[i].contains("src"))i++;
                       String destinationFile = "image"+(counter++)+".jpg";
                       saveImage(parts[i].substring(5,parts[i].length()-1), destinationFile);
                       String tmp=scanner.nextLine();
                       System.out.println(url);

                }
        scanner.close();
        }
            catch (FileNotFoundException e) 
            {
                System.out.println ("File not found!");
                System.exit (0);
            }

}

public static void saveImage(String imageUrl, String destinationFile) throws IOException {
    // TODO Auto-generated method stub
    URL url = new URL(imageUrl);
    String fileName = url.getFile();
    String destName = fileName.substring(fileName.lastIndexOf("/"));
    System.out.println(destName);
    InputStream is = url.openStream();
    OutputStream os = new FileOutputStream(destinationFile);

    byte[] b = new byte[2048];
    int length;

    while ((length = is.read(b)) != -1) {
        os.write(b, 0, length);
    }

    is.close();
    os.close();
}

I also got a tip to use the apache jakarte http client libraries but i got absolutely no idea how i could use those i would appreciate any help.

  • You can dig into some examples from: http://hc.apache.org/httpcomponents-client-ga/examples.html – px5x2 May 13 '14 at 20:32
  • You'd benefit from using a library like [jsoup](http://jsoup.org/) which makes parsing HTML very easy. Note that you'll not only run into image URLs that are missing a scheme, but you'll also run into **relative** paths, which you'll have to append to the site's URL in order to get. For example, you'll see `'`, which you'd need to append to `"https://www.google.com/"`. – sgbj May 13 '14 at 20:40
  • can you show us some samples of URL placed in your text file. – Braj May 13 '14 at 20:40

3 Answers3

3

A URL (a type of URI) requires a scheme in order to be valid. In this case, http.

When you type www.google.com into your browser, the browser is inferring you mean http:// and automatically prepends it for you. Java doesn't do this, hence your exception.

Make sure you always have http://. You can easily fix this using regex:

String fixedUrl = stringUrl.replaceAll("^((?!http://).{7})", "http://$1");

or

if(!stringUrl.startsWith("http://"))
    stringUrl = "http://" + stringUrl;
Qix - MONICA WAS MISTREATED
  • 14,451
  • 16
  • 82
  • 145
  • Look at the OP post again **I also tried to add the http:// to the img src images which didnt have it, but i still get the same error:** – Braj May 13 '14 at 20:38
  • `The sites where it works the img src starts with http://` and `MalformedURLException: no protocol`. This is clearly his issue. – Qix - MONICA WAS MISTREATED May 13 '14 at 20:42
  • 2
    Hi im atm on phone so can't check stuff out till tomorrow but thanks for your lightning fast answer. I also tried to add http:// with if(!parts[i].startsWith("http:"))parts[i]="http:" + parts[i] ; with the same error – user3634163 May 13 '14 at 20:56
1

An alternative solution

Simply try with ImageIO that contains static convenience methods for locating ImageReaders and ImageWriters, and performing simple encoding and decoding.

Sample code:

// read a image from the URL
// I used the URL that is your profile pic on StackOverflow
BufferedImage image = ImageIO
        .read(new URL(
                "https://www.gravatar.com/avatar/3935223a285ab35a1b21f31248f1e721?s=32&d=identicon&r=PG&f=1"));

// save the image
ImageIO.write(image, "jpg", new File("resources/avatar.jpg"));
Braj
  • 46,415
  • 5
  • 60
  • 76
0

When you're scraping the site's HTML for image elements and their src attributes, you'll run into several different representations of URLs.

Some examples are:

  1. resource = https://google.com/images/srpr/logo9w.png
  2. resource = google.com/images/srpr/logo9w.png
  3. resource = //google.com/images/srpr/logo9w.png
  4. resource = /images/srpr/logo9w.png
  5. resource = images/srpr/logo9w.png

For the second through fifth ones, you'll need to build the rest of the URL.

The second one may be more difficult to differentiate from the fourth and fifth ones, but I'm sure there are workarounds. The URL Standard leads me to believe you won't see it as often, because I don't think it's technically valid.

The third case is pretty simple. If the resource variable starts with //, then you just need to prepend the protocol/scheme to it. You can do this with the site object you have:

url = site.getProtocol() + ":" + resource

For the fourth and fifth cases, you'll need to prepend the resource with the entire site's URL.

Here's a sample application that uses jsoup to parse the HTML, and a simple utility method to build the resource URL. You're interested in the buildResourceUrl method. Also, it doesn't handle the second case; I'll leave that to you.

import java.io.*;
import java.net.*;
import org.jsoup.*;
import org.jsoup.nodes.*;
import org.jsoup.select.*;

public class SiteScraper {

    public static void main(String[] args) throws IOException {
        URL site = new URL("https://google.com/");
        Document doc = Jsoup.connect(site.toString()).get();
        Elements images = doc.select("img");
        for (Element image : images) {
            String src = image.attr("src");
            System.out.println(buildResourceUrl(site, src));
        }
    }

    static URL buildResourceUrl(URL site, String resource) 
            throws MalformedURLException {
        if (!resource.matches("^(http|https|ftp)://.*$")) {
            if (resource.startsWith("//")) {
                return new URL(site.getProtocol() + ":" + resource);
            } else {
                return new URL(site.getProtocol() + "://" + site.getHost() + "/" 
                        + resource.replaceAll("^/", ""));
            }
        }
        return new URL(resource);
    }
}

This obviously won't cover everything, but it's a start. You may run into problems when the URL you're trying to access is in a subdirectory of the root of the site (i.e., http://some.place/under/the/rainbow.html). You may even encounter base64 encoded data URI's in the src attribute... It really depends on the individual case and how far you're willing to go.

Community
  • 1
  • 1
sgbj
  • 2,264
  • 17
  • 14