0

Right now I am trying to scrape a website page but i want to also save the static files so when I try to reproduce the website I will be closer and closer to the original one.

So for an example if you go on Wikipedia, on any page and click as Save As it will download the HTML page and a folder with images, some CSS or PHP files. That folder is my focus.

So I search on the internet but couldn't find anything that helped me. So far I have a function that takes an array of URLs and saved the HTML code locally, but again I want the static files too.

private List<FileRelation> scrapUrls(List<String> urlsToScrap, String jobPath) throws IOException {
    List<FileRelation> files = new ArrayList<>();
    Document doc;
    for (String url: urlsToScrap) {
        try {
            doc = Jsoup.connect(url)
                       .userAgent("Mozilla/5.0 (X11; U; Linux i586; en-US; rv:1.7.3) Gecko/20040924 Epiphany/1.4.4 (Ubuntu)")
                       .timeout(5000)
                       .get();
            String fileName = UUID.randomUUID() + ".html";
            BufferedWriter writer =
                new BufferedWriter(
                    new FileWriter(jobPath + "\\" + fileName)
                );
            writer.write(doc.html());

            FileRelation file = new FileRelation(fileName, jobPath);
            files.add(file);
        } catch (Exception e) {
            System.out.println("ScarpingService - " + e);
        }
    }
    return files;
}
Janez Kuhar
  • 3,705
  • 4
  • 22
  • 45
Horatiuw
  • 179
  • 2
  • 2
  • 12
  • I am not sure that jsoup is the right tool for this task. Why don't you use *Selenium* that lets you access the browser functionality from Java? See also [Java Webdriver: How to save the page same as "save page as" in firefox?](https://stackoverflow.com/q/24857931/6367213). – Janez Kuhar Sep 16 '21 at 08:12
  • +1 - but if you did particularly want to just use jsoup, the approach would be to fetch the page like you are now, then find every resource in it. A selector like "[src], [href]" will find all elements with one of those attributes. Then add those attributes (see absUrl) to a fetch queue, and get and save those. You would miss indirect resources though (like images loaded from a CSS file, or fetched via javascript, etc). Might be useful for simple sites though. – Jonathan Hedley Sep 21 '21 at 12:05

0 Answers0