Reading and saving the full HTML contents of a URL to a text file

Question

Requirement:

To read HTML from any website say "http://www.twitter.com ".

Print the retrived HTML

Save it to a text file on local machine .

Code:

import java.net.*;

import java.io.*;

public class oddless {
    public static void main(String[] args) throws Exception {

        URL oracle = new URL("http://www.fetagracollege.org");
        BufferedReader in = new BufferedReader(new InputStreamReader(oracle.openStream()));

        OutputStream os = new FileOutputStream("/Users/Rohan/new_sourcee.txt");


        String inputLine;
        while ((inputLine = in.readLine()) != null)
            System.out.println(inputLine);
        in.close();
    }
}

Code above retrieves the data, prints it on console and saves it to a text file but mostly it retrieves only half code (because of line space in html code). It does not save the code further.

Questions:

How can I save the full html code?

Are there any other alternatives?

Don't c.ose the InputSteam until your finished reading it. Make sure you flush (if required) and close the OutputStream when you're done with it. All this should be done within a try-catch-finally block — MadProgrammer, Mar 22 '14 at 07:32
Try Apache's Commons IO, they're great for copying entire streams and have been well tested. I've used the library in ~70% of my Android and JavaSE projects and it has worked great. You can find it here: http://commons.apache.org/proper/commons-io/ — lucian.pantelimon, Mar 22 '14 at 07:40

score 0 · Answer 1 · answered Mar 22 '14 at 07:46

I have used different approach but I received same output like you. Is not there problem on server side of this URL?

CloseableHttpClient httpclient = HttpClients.createDefault();
HttpGet httpGet = new HttpGet("http://www.fetagracollege.org");
CloseableHttpResponse response1 = httpclient.execute(httpGet);
try {
    System.out.println(response1.getStatusLine());
    HttpEntity entity1 = response1.getEntity();
    String content = EntityUtils.toString(entity1);
    System.out.println(content);
} finally {
    response1.close();
}

It finishes with:

    </table>
    <p><br>

UPDATE: This Faculty of Engineering and Technology does not have well formed home page. This content is complete, your code works well. But commentators have right, you shall use try/catch/finally block.

Bobby-Z · Answer 2 · 2014-03-25T07:12:14.003

I use this code whenever connecting to a website through Java

import java.io.*;
import java.net.*;

public class Connection
{
    public static void main(String[] args) throws Exception
    {
        new Connection();
    }
    public Connection() throws Exception
    {
        URL url = new URL("http://www.fetagracollege.org"); //The URL
        HttpURLConnection huc = connect(url); //Connects to the website
        huc.connect(); //Opens the connection
        String str = readBody(huc); //Reads the response
        huc.disconnect(); //Closes
        System.out.println(str); //Prints all output to the console
    }

    private String readBody(HttpURLConnection huc) throws Exception //Reads the response
    {
        InputStream is = huc.getInputStream(); //Inputstream
        BufferedReader rd = new BufferedReader(new InputStreamReader(is)); //BufferedReader
        String line;
        StringBuffer response = new StringBuffer();
        while ((line = rd.readLine()) != null)
        {
            response.append(line); //Append the line
            response.append('\n'); //and a new line
        }
        rd.close();
        return response.toString();
    }

    private HttpURLConnection connect(URL url) throws Exception //Connect to the URL
    {
        HttpURLConnection huc = (HttpURLConnection) url.openConnection(); //Opens connection to the website
        huc.setReadTimeout(15000); //Read timeout - 15 seconds
        huc.setConnectTimeout(15000); //Connecting timeout - 15 seconds
        huc.setUseCaches(false); //Don't use cache
        HttpURLConnection.setFollowRedirects(true); //Follow redirects if there are any
        huc.addRequestProperty("Host", "www.fetagracollege.org"); //www.fetagracollege.org is the host
        huc.addRequestProperty("User-Agent", "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/29.0.1547.57 Safari/537.36"); //Chrome user agent
        return huc;
    }
}

The website ended with this, so I think the problem is server-side, as other websites work with this code (tested with twitter and google):

                            </font>&copy; fetaca 2011 </td>
                    </tr>
            </table>
    <p><br>

score 0 · Answer 3 · answered Jun 17 '15 at 11:32

for reading contents from an URL, you can use jsoup and then you can wright the contents using file handling concept(OutputStream out =....), so for reading by using jsoup:

String url = "URL"; // getting URL
Document doc = Jsoup.connect(url).get(); // getting content as document type
String line = input.toString(); // getting contents as String type

Now after having the content in string u can easily flush it into a file.

For this - you will require jsoup jars. - import 3(three) classes: import org.jsoup.Jsoup; import org.jsoup.nodes.Document; import org.jsoup.select.Elements;

Reading and saving the full HTML contents of a URL to a text file

Requirement:

Code:

Questions:

3 Answers3