1

There are ways to download an entire webpage, using HTMLEditorKit. However, I need to download an entire webpage which needs scrolling in order to load its entire content. This technology is achieved most commonly through JavaScript bundled with Ajax.

Q.: Is there a way to trick the destined webpage, using only Java code, in order to download its full content?

Q.2: If this is not possible only with Java, then is it possible in combination with JavaScript?

Simple notice, what I wrote:

public class PageDownload {

    public static void main(String[] args) throws Exception {
        String webUrl = "...";
        URL url = new URL(webUrl);
        URLConnection connection = url.openConnection();
        InputStream is = connection.getInputStream();
        InputStreamReader isr = new InputStreamReader(is);
        BufferedReader br = new BufferedReader(isr);

        HTMLEditorKit htmlKit = new HTMLEditorKit();
        HTMLDocument htmlDoc = (HTMLDocument) htmlKit.createDefaultDocument();
        HTMLEditorKit.Parser parser = new ParserDelegator();
        HTMLEditorKit.ParserCallback callback = htmlDoc.getReader(0);
        parser.parse(br, callback, true);

        for (HTMLDocument.Iterator iterator = htmlDoc.getIterator(HTML.Tag.IMG);
                iterator.isValid(); iterator.next()) {
            AttributeSet attributes = iterator.getAttributes();
            String imgSrc = (String) attributes.getAttribute(HTML.Attribute.SRC);
            if (imgSrc != null && (imgSrc.endsWith(".jpg") || (imgSrc.endsWith(".jpeg"))
                    || (imgSrc.endsWith(".png")) || (imgSrc.endsWith(".ico"))
                    || (imgSrc.endsWith(".bmp")))) {
                try {
                    downloadImage(webUrl, imgSrc);
                } catch (IOException ex) {
                    System.out.println(ex.getMessage());
                }
            }
        }

    }

    private static void downloadImage(String url, String imgSrc) throws IOException {
        BufferedImage image = null;
        try {
            if (!(imgSrc.startsWith("http"))) {
                url = url + imgSrc;
            } else {
                url = imgSrc;
            }
            imgSrc = imgSrc.substring(imgSrc.lastIndexOf("/") + 1);
            String imageFormat = null;
            imageFormat = imgSrc.substring(imgSrc.lastIndexOf(".") + 1);
            String imgPath = null;
            imgPath = "..." + imgSrc + "";
            URL imageUrl = new URL(url);
            image = ImageIO.read(imageUrl);
            if (image != null) {
                File file = new File(imgPath);
                ImageIO.write(image, imageFormat, file);
            }
        } catch (Exception ex) {
            ex.printStackTrace();
        }
    }

}
Insanovation
  • 337
  • 6
  • 21

4 Answers4

3

Use HtmlUnit library to get all text and images/css files.

HTMLUnit [link] htmlunit.sourceforge.net

1) To download text content use code on below link s

all Text content [link] How to get a HTML page using HtmlUnit

Specific tag such as span [link] how to get text between a specific span with HtmlUnit

2) To get images/files use below [link] How can I tell HtmlUnit's WebClient to download images and css?

Community
  • 1
  • 1
Vishvesh Phadnis
  • 2,448
  • 5
  • 19
  • 35
2

Yes you can trick a a webpage to download on your locals by Java code. You can not Download HTMl Static content by Java Script. JavaScript is not providing you to create a files as Java Provides.

import java.io.File;
import java.io.FileOutputStream;
import java.io.IOException;
import java.io.InputStream;
import java.net.HttpURLConnection;
import java.net.URL;


public class HttpDownloadUtility {
    private static final int BUFFER_SIZE = 4096;

    /**
     * Downloads a file from a URL
     * @param fileURL HTTP URL of the file to be downloaded
     * @param saveDir path of the directory to save the file
     * @throws IOException
     */
    public static void downloadFile(String fileURL, String saveDir)
            throws IOException {
        URL url = new URL(fileURL);
        HttpURLConnection httpConn = (HttpURLConnection) url.openConnection();
        int responseCode = httpConn.getResponseCode();

        // always check HTTP response code first
        if (responseCode == HttpURLConnection.HTTP_OK) {
            String fileName = "";
            String disposition = httpConn.getHeaderField("Content-Disposition");
            String contentType = httpConn.getContentType();
            int contentLength = httpConn.getContentLength();

            if (disposition != null) {
                // extracts file name from header field
                int index = disposition.indexOf("filename=");
                if (index > 0) {
                    fileName = disposition.substring(index + 10,
                            disposition.length() - 1);
                }
            } else {
                // extracts file name from URL
                fileName = fileURL.substring(fileURL.lastIndexOf("/") + 1,
                        fileURL.length());
            }

            System.out.println("Content-Type = " + contentType);
            System.out.println("Content-Disposition = " + disposition);
            System.out.println("Content-Length = " + contentLength);
            System.out.println("fileName = " + fileName);

            // opens input stream from the HTTP connection
            InputStream inputStream = httpConn.getInputStream();
            String saveFilePath = saveDir + File.separator + fileName;

            // opens an output stream to save into file
            FileOutputStream outputStream = new FileOutputStream(saveFilePath);

            int bytesRead = -1;
            byte[] buffer = new byte[BUFFER_SIZE];
            while ((bytesRead = inputStream.read(buffer)) != -1) {
                outputStream.write(buffer, 0, bytesRead);
            }

            outputStream.close();
            inputStream.close();

            System.out.println("File downloaded");
        } else {
            System.out.println("No file to download. Server replied HTTP code: " + responseCode);
        }
        httpConn.disconnect();
    }
}
UtkarshBhavsar
  • 249
  • 3
  • 23
  • insanovation am I making sense to you for your asked question. – UtkarshBhavsar Oct 27 '14 at 10:50
  • I'm really busy working on something else right now, but I'll get back to this subject as soon as I can (in 7 hours). Your help will be rewarded, right after I'll study your proposed solution. Thank you for your understanding. – Insanovation Oct 27 '14 at 13:07
  • Great, it worked. However, I tested it on 9gag.com and it didn't download the entire content. If scrolling through 9gag, for about 30 seconds, you will get to the bottom of the page. Till then, there are a lot of images and their endings .jpg or .gif are not present in the downloaded file provided by your code. I assume that your way may be the only one exposed here... If there won't be posted a more effective code, then the bounty will go to you. Thank you. – Insanovation Oct 27 '14 at 20:58
  • There are some sort of software that provides the facility to download entire page with css, js, images and fonts. but if you are using Java Program than you can download only the content provided in URL (here HTML code only). – UtkarshBhavsar Oct 28 '14 at 04:18
1

You can achieve this with Selenium Webdriver java classes...

https://code.google.com/p/selenium/wiki/GettingStarted

Generally, webdriver is used for testing, but it is able to emulate a user scrolling down the page, until the page stops changing, and then you can use java code to save the content to a file.

CharlieS
  • 1,432
  • 1
  • 9
  • 10
-3

You can do it using IDM's grabber.

This should help: https://www.internetdownloadmanager.com/support/idm-grabber/grabber_wizard.html