HTML DOM to Download Image from URI

Question

I have created a list of all page uris I'd like to download an image from for a vehicle service manual.

The images are delivered via a PHP script,as can be seen here http://www.atfinley.com/service/index.php?cat=g2&page=32

This is probably meant to deter behaviors like my own, however, every single Acura Legend owner shouldn't depend on a single host for their vehicle's manual.

I'd like to design a bot in JS/Java that can visit every url I've stored in this txt document https://pastebin.com/yXdMJipq

To automate the download of the available png at the resource.

I'll eventually be creating a pdf of the manual, and publishing it for open and free use.

If anyone has ideas for libraries I could use, or ways to delve into the solution, please let me know. I am most fluent in Java.

I'm thinking a solution might be to fetch the html document at each url, and download the image from the <img src>argument.

There is a `print/save`-Button. You should get the links for you imgs from there — Penguin9, Jul 24 '17 at 09:28

score 1 · Answer 1 · answered Jul 24 '17 at 09:41

I know you asked for a JavaScript solution but I believe PHP (which you also added as a tag) is more suitable for the task. Here are some guidelines to get you started:

Move all the URLs into an array and create a foreach loop that will iterate on it.
Inside the loop use the PHP Simple HTML DOM Parser to retrieve the image URL attribute for each page.
Still inside the loop use the URL for the image in a CURL request to grab the file from that and save it into your custom folder. You can find the code required for this part here.

If this process proves to be too long and you get a PHP runtime error consider storing the URLs generated by step 2 in a file and then using that file to generate a new array and run step 3 on it as a separate process.

Thanks for the response, I am a bit more familiar in Java, and PHP wasn't configured on my machine. However I did enjoy studying your solution, it has very little overhead. — Daniel Winston, Jul 25 '17 at 07:55

tuberains · Accepted Answer · 2017-07-24T13:35:05.537

1

I have written something similar but unfortunately, i can't find it anymore. Nevertheless, i remember using the JSoup Java-library which comes in pretty handy.

It includes an HTTP-client and you can run CSS-selectors on the document just like with jQuery...

This is the example from their frontpage:

Document doc = Jsoup.connect("http://en.wikipedia.org/").get();
Elements newsHeadlines = doc.select("#mp-itn b a");

Creating PDFs is quite tricky, but i use Apache PDFBox for such things...

edited Jul 24 '17 at 13:35

answered Jul 24 '17 at 09:47

tuberains

191
9

Extremely useful for me, thank you so much, was able to find the CSS Selector in Inspector. Literally a copy paste solution to parse the html from the document. – Daniel Winston Jul 25 '17 at 07:54

score 0 · Answer 3 · answered Jul 25 '17 at 09:06

Finished solution for grabbing image urls;

import java.io.File;
import java.io.FileNotFoundException;
import java.io.FileWriter;
import java.io.IOException;
import java.io.Writer;
import java.util.Scanner;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;

public class Acura {

public static void main(String[] args) throws IOException {

    Scanner read;
    Writer write;
    try {
        File list = new File("F:/result.txt");
        read = new Scanner(list);
        write = new FileWriter("F:/imgurls.txt");
        double s = 0;

        while(read.hasNextLine())
            try {
                s++;
                String url = read.nextLine();
                Document doc = Jsoup.connect(url).get();
                Element img = doc.select("img").first();
                String imgUrl = img.absUrl("src");
                write.write(imgUrl + "\n");
                System.out.println((double)(s/2690) + "%");
            } catch (IOException e) {
                // TODO Auto-generated catch block
                e.printStackTrace();
            }
        read.close();
        write.close();
    } catch (FileNotFoundException e1) {
        // TODO Auto-generated catch block
        e1.printStackTrace();
    }
    }
}

Generates a nice long list of image urls in a text document. Could have done it in a non-sequential manner, but was heavily intoxicated when I did this. However I did add a progress bar for my own peace of mind :)

score 0 · Answer 4 · answered Jul 25 '17 at 09:27

Scanner read;
    Writer write;
    try {
        File list = new File("F:/imgurls.txt");
        read = new Scanner(list);
        double s = 0;

        while(read.hasNextLine())
            try {
                s++;
                String url = read.nextLine();
                Response imageResponse = Jsoup.connect(url).ignoreContentType(true).execute();
                FileOutputStream writer = new FileOutputStream(new java.io.File("F:/Acura/" + (int) s + ".png"));
                writer.write(imageResponse.bodyAsBytes());
                writer.close();
                System.out.println((double)(s/2690) + "%");
            } catch (IOException e) {
                // TODO Auto-generated catch block
                e.printStackTrace();
            }
        read.close();
    } catch (FileNotFoundException e1) {
        // TODO Auto-generated catch block
        e1.printStackTrace();
    }
    }

Worked for generating pngs

HTML DOM to Download Image from URI

4 Answers4