How do I download all the files from a host

Question

I want to take all the files from a host and transfer it into a directory using Java code, for example, https://en.m.wikipedia.org/wiki/2006_ACC_Championship_Game,https://en.m.wikipedia.org/wiki/Bernice,_California,https://en.m.wikipedia.org/wiki/TB10Cs5H3_snoRNA.

And then I want to download everything from https://en.m.wikipedia.org, or specifically https://en.m.wikipedia.org/wiki, and all the files that come after it.

I think I know the way to do this with sockets, URLs, and ports, but then again...

Yes, but I want all the downloaded files in the same organization as on the site. — SirMathhman, Dec 01 '15 at 04:22
I don't think you can download *all* the files for security reasons. You can just probably scrape all of their public files. — Keale, Dec 01 '15 at 04:28

score 0 · Answer 1 · answered Dec 01 '15 at 04:32

You can use Jsoup for this. Very easy to use, an example is:

Connection.Response response = Jsoup.connect(url).method(Method.GET).execute();

Document doc = response.parse();
String html = doc.html(); //this contains the html in order

Then write your string that contains html to a file in the directory you want!

score 0 · Accepted Answer · edited May 23 '17 at 11:50

Wikipedia API

Do you just want to download Wikipedia articles? If you do then you can just use Wikiepedia's API. I found it on a question here

A Simple Example^{copy-pasted from the main page}

This URL tells English Wikipedia's web service API to send you the content of the main page:

https://en.wikipedia.org/w/api.php?action=query&titles=Main%20Page&prop=revisions&rvprop=content&format=json

Use any programming language to make an HTTP GET request for that URL (or just visit that link in your browser), and you'll get a JSON document which includes the current wiki markup for the page titled "Main Page". Changing the format to jsonfm will return a "pretty-printed" HTML result good for debugging.

Here is the jsonfm URL as an easier-to-read clickable link.

api.php ? action=query & titles=Main%20Page & prop=revisions & rvprop=content & format=jsonfm

Let's pick that URL apart to show how it works.

The endpoint

https://en.wikipedia.org/w/api.php

This is the endpoint. It's like the home page of the MediaWiki web service API. This URL is the base URL for English Wikipedia's API, just as https://en.wikipedia.org/wiki/ is the base URL for its web site.

If you're writing a program to use English Wikipedia, every URL you construct will begin with this base URL. If you're using a different MediaWiki installation, you'll need to find its endpoint and use that instead. All Wikimedia wikis have endpoints that follow this pattern:

https://en.wikipedia.org/w/api.php      # English Wikipedia API
https://nl.wikipedia.org/w/api.php      # Dutch Wikipedia API
https://commons.wikimedia.org/w/api.php # Wikimedia Commons API

Since r75621, we have RSD discovery for the endpoint: look for the link rel="EditURI" in the HTML source of any page and extract the api.php URL; the actual link contains additional info. For instance, on this wiki it's:

<link rel="EditURI" type="application/rsd+xml" href="//www.mediawiki.org/w/api.php?action=rsd" />

Otherwise, there's no safe way to locate the endpoint on any wiki. If you're lucky, either the full path to index.php will not be hidden under strange rewrite rules so that you'll only have to take the "edit" (or history) link and replace index.php (etc.) with api.php, or you'll be able to use the default script path (like w/api.php).

Now let's move on to the parameters in the query string of the URL.

The format

format=json

This tells the API that we want data to be returned in JSON format. You might also want to try format=jsonfm to get an HTML version of the result that is good for debugging. The API supports other output formats such as XML and native PHP, but there are plans to remove less popular formats (phab:T95715), so you might not want to use them.

The action

action=query

The MediaWiki web service API implements dozens of actions and extensions implement many more; the dynamically generated API help documents all available actions on a wiki. In this case, we're using the "query" action to get some information. The "query" action is one of the API's most important actions, and it has extensive documentation of its own. What follows is just an explanation of a single example.

Action-specific parameters

titles=Main%20Page

The rest of the example URL contains parameters used by the "query" action. Here, we're telling the web service API that we want information about the Wiki page called "Main Page". (The %20 comes from percent-encoding a space.) If you need to query multiple pages, put them all in one request to optimize network and server resources: titles=PageA|PageB|PageC. See the query documentation for details.

prop=revisions

You can request many kinds of information, or properties, about a page. This parameter tells the web service API that we want information about a particular revision of the page. Since we're not specifying any revision information, the API will give us information about the latest revision — the main page of Wikipedia as it stands right now.

rvprop=content

Finally, this parameter tells the web service API that we want the content of the latest revision of the page. If we passed in rvprop=content|user instead, we'd get the latest page content and the name of the user who made the most recent revision.

Again, this is just one example. Queries are explained in more detail here, and the API reference lists all the possible actions, all the possible values for rvprop, and so on.

More Generic Approach

This is for a more generic approach. Use Jsoup to scrape html.

You have to find a way to get all the urls that you want to scrape. Maybe save it in a String array or List. Then iterate through it like so:

for (String url : urls) {
    downloadAllFilesOnURL(url);
}

Create a method downloadAllFilesOnURL(String url). This takes in a url as a String parameter. Then connect to it using JSoup.

Document doc = Jsoup.connect(url).timeout(60 * 1000)//60 seconds
                    .userAgent("Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2) "
                    + "AppleWebKit/537.36 (KHTML, like Gecko) "
                    + "Chrome/33.0.1750.152 Safari/537.36").get();

Then, write the doc object to a file

PrintWriter pen = new PrintWriter(<somefilehere>);
pen.println(doc.toString());
pen.close();

Get all the links on that url so that we can access them too. Here we can recursively call our previous method downloadAllFilesOnURL(String url). You do it like so:

Elements anchorElements = doc.select("a");
for(Element anchor : anchorElements) {
    downloadAllFilesOnURL(anchor.attr("abs:href"));
}

As for images and other files, the logic is the same.

Elements imageElements = doc.select("img");
for(Element image : imageElements) {
    downloadFile(image.attr("abs:src");
}

Here we declared a method downloadFile(String url). This method downloads a file from a url like http://media.example.com/ariticleA/image2.png

Connection.Response response = Jsoup.connect(url).timeout(60 * 1000)
                    .userAgent("Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2) "
                    + "AppleWebKit/537.36 (KHTML, like Gecko) "
                    + "Chrome/33.0.1750.152 Safari/537.36").ignoreContentType(true)
                    .execute();

byte[] bytes = response.bodyAsBytes();

//this is how you write the file
try (FileOutputStream outputStream = new FileOutputStream(<somefilehere>)) {
        outputStream.write(bytes);
} catch (Exception e) {  }

This is just a guide on how to do this. If I were the one assigned to do this task, this will be my approach.

How do I download all the files from a host

2 Answers2

Wikipedia API

More Generic Approach