How do you Programmatically Download a Webpage in Java

Question

I would like to be able to fetch a web page's html and save it to a String, so I can do some processing on it. Also, how could I handle various types of compression.

How would I go about doing that using Java?

This is basically a special case of https://stackoverflow.com/questions/921262/how-to-download-and-save-a-file-from-internet-using-java — Robin Green, Nov 17 '18 at 06:29

score 184 · Answer 1 · edited Jun 20 '20 at 09:12

184

I'd use a decent HTML parser like Jsoup. It's then as easy as:

String html = Jsoup.connect("http://stackoverflow.com").get().html();

It handles GZIP and chunked responses and character encoding fully transparently. It offers more advantages as well, like HTML traversing and manipulation by CSS selectors like as jQuery can do. You only have to grab it as Document, not as a String.

Document document = Jsoup.connect("http://google.com").get();

You really don't want to run basic String methods or even regex on HTML to process it.

3

Good answer. A little late. `;)` – jjnguy Jan 01 '11 at 00:00
Why did noone tell me about .html() before. I looked so hard into how to easily store the html fetched by Jsoup and that helps a lot. – Avamander Jul 14 '16 at 20:17
for newcomers , if you use this library in android you need to use this in different thread because it runs by default on same application thread which will cause the application to throw `NetworkOnMainThreadException` – Mohammed Elrashied Jul 18 '18 at 12:23

score 117 · Accepted Answer · edited Aug 20 '13 at 21:08

117

Here's some tested code using Java's URL class. I'd recommend do a better job than I do here of handling the exceptions or passing them up the call stack, though.

public static void main(String[] args) {
    URL url;
    InputStream is = null;
    BufferedReader br;
    String line;

    try {
        url = new URL("http://stackoverflow.com/");
        is = url.openStream();  // throws an IOException
        br = new BufferedReader(new InputStreamReader(is));

        while ((line = br.readLine()) != null) {
            System.out.println(line);
        }
    } catch (MalformedURLException mue) {
         mue.printStackTrace();
    } catch (IOException ioe) {
         ioe.printStackTrace();
    } finally {
        try {
            if (is != null) is.close();
        } catch (IOException ioe) {
            // nothing to see here
        }
    }
}

edited Aug 20 '13 at 21:08

Stas Yak

109
1
3

answered Oct 26 '08 at 21:09

Bill the Lizard

398,270
210
566
880

16

DataInputStream.readLine() is deprecated, but other than that very good example. I used an InputStreamReader() wrapped in a BufferedReader() to get the readLine() function. – mjh2007 Feb 02 '10 at 14:44
2

This doesn't take character encoding into account, so while it'll appear to work for ASCII text, it will eventually result in 'strange characters' when there's a mismatch. – artbristol Jul 22 '12 at 08:25
In the 3rd line replace `DataInputStream` to `BufferedReader`. And replace `"dis = new DataInputStream(new BufferedInputStream(is));"` to `"dis = new BufferedReader(new InputStreamReader(is));"` – kolobok Apr 18 '13 at 14:17
1

@akapelko Thanks you. I updated my answer to remove the calls to deprecated methods. – Bill the Lizard Apr 18 '13 at 14:32
2

what about closing the `InputStreamReader`? – Alexander Dec 03 '16 at 00:39
if you need to get all lines together use StringBuilder append("line") method instead of System.out.println(line); - it will be the most efficient way to put together all lines – Kirill Karmazin Jul 11 '17 at 12:34
This is not closing its socket. – Andrew Dec 14 '18 at 16:31

jjnguy · Answer 3 · 2010-04-06T05:23:37.880

Bill's answer is very good, but you may want to do some things with the request like compression or user-agents. The following code shows how you can various types of compression to your requests.

URL url = new URL(urlStr);
HttpURLConnection conn = (HttpURLConnection) url.openConnection(); // Cast shouldn't fail
HttpURLConnection.setFollowRedirects(true);
// allow both GZip and Deflate (ZLib) encodings
conn.setRequestProperty("Accept-Encoding", "gzip, deflate");
String encoding = conn.getContentEncoding();
InputStream inStr = null;

// create the appropriate stream wrapper based on
// the encoding type
if (encoding != null && encoding.equalsIgnoreCase("gzip")) {
    inStr = new GZIPInputStream(conn.getInputStream());
} else if (encoding != null && encoding.equalsIgnoreCase("deflate")) {
    inStr = new InflaterInputStream(conn.getInputStream(),
      new Inflater(true));
} else {
    inStr = conn.getInputStream();
}

To also set the user-agent add the following code:

conn.setRequestProperty ( "User-agent", "my agent name");

For those looking to convert the InputStream to string, see [this answer](https://stackoverflow.com/a/3627441). — SE Does Not Like Dissent, Aug 22 '19 at 15:40
setFollowRedirects helps, I use setInstanceFollowRedirects in my case, I was getting empty web pages in many cases before using that. I assume that you try to use compression to download the file faster. — gouessej, Aug 07 '20 at 23:17

score 13 · Answer 4 · edited Nov 11 '14 at 18:56

13

Well, you could go with the built-in libraries such as URL and URLConnection, but they don't give very much control.

~~Personally I'd go with the Apache HTTPClient library.~~
Edit: HTTPClient has been set to end of life by Apache. The replacement is: HTTP Components

edited Nov 11 '14 at 18:56

rogerdpack

62,887
36
269
388

answered Oct 26 '08 at 20:20

Jon Skeet

1,421,763
867
9,128
9,194

There is no java version of System.Net.WebRequest? – FlySwat Oct 26 '08 at 20:54
1

Sort of, that would be URL. :-) For example: new URL("http://www.google.com").openStream() // => InputStream – Daniel Spiewak Oct 26 '08 at 21:01
1

@Jonathan: What Daniel said, for the most part - although WebRequest gives you more control than URL. HTTPClient is closer in functionality, IMO. – Jon Skeet Oct 26 '08 at 21:05

score 9 · Answer 5 · answered May 30 '14 at 10:30

All the above mentioned approaches do not download the web page text as it looks in the browser. these days a lot of data is loaded into browsers through scripts in html pages. none of above mentioned techniques supports scripts, they just downloads the html text only. HTMLUNIT supports the javascripts. so if you are looking to download the web page text as it looks in the browser then you should use HTMLUNIT.

Supercoder · Answer 6 · 2019-06-04T17:06:33.800

You'd most likely need to extract code from a secure web page (https protocol). In the following example, the html file is being saved into c:\temp\filename.html Enjoy!

import java.io.BufferedReader;
import java.io.BufferedWriter;
import java.io.FileWriter;
import java.io.InputStream;
import java.io.InputStreamReader;
import java.net.URL;

import javax.net.ssl.HttpsURLConnection;

/**
 * <b>Get the Html source from the secure url </b>
 */
public class HttpsClientUtil {
    public static void main(String[] args) throws Exception {
        String httpsURL = "https://stackoverflow.com";
        String FILENAME = "c:\\temp\\filename.html";
        BufferedWriter bw = new BufferedWriter(new FileWriter(FILENAME));
        URL myurl = new URL(httpsURL);
        HttpsURLConnection con = (HttpsURLConnection) myurl.openConnection();
        con.setRequestProperty ( "User-Agent", "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:63.0) Gecko/20100101 Firefox/63.0" );
        InputStream ins = con.getInputStream();
        InputStreamReader isr = new InputStreamReader(ins, "Windows-1252");
        BufferedReader in = new BufferedReader(isr);
        String inputLine;

        // Write each line into the file
        while ((inputLine = in.readLine()) != null) {
            System.out.println(inputLine);
            bw.write(inputLine);
        }
        in.close(); 
        bw.close();
    }
}

score 1 · Answer 7 · answered Jun 15 '20 at 19:23

1

To do so using NIO.2 powerful Files.copy(InputStream in, Path target):

URL url = new URL( "http://download.me/" );
Files.copy( url.openStream(), Paths.get("downloaded.html" ) );

answered Jun 15 '20 at 19:23

Jan Tibar

46
5

score 0 · Answer 8 · answered Oct 26 '08 at 20:43

0

On a Unix/Linux box you could just run 'wget' but this is not really an option if you're writing a cross-platform client. Of course this assumes that you don't really want to do much with the data you download between the point of downloading it and it hitting the disk.

answered Oct 26 '08 at 20:43

Timo Geusch

24,095
5
52
70

i would also start with this approach and refactor it later if insufficient – Dustin Getz Oct 03 '09 at 19:55

Jan Bodnar · Answer 9 · 2021-05-01T17:35:44.607

Jetty has an HTTP client which can be use to download a web page.

package com.zetcode;

import org.eclipse.jetty.client.HttpClient;
import org.eclipse.jetty.client.api.ContentResponse;

public class ReadWebPageEx5 {

    public static void main(String[] args) throws Exception {

        HttpClient client = null;

        try {

            client = new HttpClient();
            client.start();
            
            String url = "http://example.com";

            ContentResponse res = client.GET(url);

            System.out.println(res.getContentAsString());

        } finally {

            if (client != null) {

                client.stop();
            }
        }
    }
}

The example prints the contents of a simple web page.

In a Reading a web page in Java tutorial I have written six examples of dowloading a web page programmaticaly in Java using URL, JSoup, HtmlCleaner, Apache HttpClient, Jetty HttpClient, and HtmlUnit.

score 0 · Answer 10 · edited Oct 12 '18 at 04:35

Get help from this class it get code and filter some information.

public class MainActivity extends AppCompatActivity {

    EditText url;
    @Override
    protected void onCreate(Bundle savedInstanceState) {
        super.onCreate( savedInstanceState );
        setContentView( R.layout.activity_main );

        url = ((EditText)findViewById( R.id.editText));
        DownloadCode obj = new DownloadCode();

        try {
            String des=" ";

            String tag1= "<div class=\"description\">";
            String l = obj.execute( "http://www.nu.edu.pk/Campus/Chiniot-Faisalabad/Faculty" ).get();

            url.setText( l );
            url.setText( " " );

            String[] t1 = l.split(tag1);
            String[] t2 = t1[0].split( "</div>" );
            url.setText( t2[0] );

        }
        catch (Exception e)
        {
            Toast.makeText( this,e.toString(),Toast.LENGTH_SHORT ).show();
        }

    }
                                        // input, extrafunctionrunparallel, output
    class DownloadCode extends AsyncTask<String,Void,String>
    {
        @Override
        protected String doInBackground(String... WebAddress) // string of webAddress separate by ','
        {
            String htmlcontent = " ";
            try {
                URL url = new URL( WebAddress[0] );
                HttpURLConnection c = (HttpURLConnection) url.openConnection();
                c.connect();
                InputStream input = c.getInputStream();
                int data;
                InputStreamReader reader = new InputStreamReader( input );

                data = reader.read();

                while (data != -1)
                {
                    char content = (char) data;
                    htmlcontent+=content;
                    data = reader.read();
                }
            }
            catch (Exception e)
            {
                Log.i("Status : ",e.toString());
            }
            return htmlcontent;
        }
    }
}

A_01 · Answer 11 · 2018-07-27T12:42:29.493

I used the actual answer to this post (url) and writing the output into a file.

package test;

import java.net.*;
import java.io.*;

public class PDFTest {
    public static void main(String[] args) throws Exception {
    try {
        URL oracle = new URL("http://www.fetagracollege.org");
        BufferedReader in = new BufferedReader(new InputStreamReader(oracle.openStream()));

        String fileName = "D:\\a_01\\output.txt";

        PrintWriter writer = new PrintWriter(fileName, "UTF-8");
        OutputStream outputStream = new FileOutputStream(fileName);
        String inputLine;

        while ((inputLine = in.readLine()) != null) {
            System.out.println(inputLine);
            writer.println(inputLine);
        }
        in.close();
        } catch(Exception e) {

        }

    }
}

How do you Programmatically Download a Webpage in Java

11 Answers11

See also:

Linked

Related