12

I just want to retrieve any web page's source code from Java. I found lots of solutions so far, but I couldn't find any code that works for all the links below:

The main problem for me is that some codes retrieve web page source code, but with missing ones. For example the code below does not work for the first link.

InputStream is = fURL.openStream(); //fURL can be one of the links above
BufferedReader buffer = null;
buffer = new BufferedReader(new InputStreamReader(is, "iso-8859-9"));

int byteRead;
while ((byteRead = buffer.read()) != -1) {
    builder.append((char) byteRead);
}
buffer.close();
System.out.println(builder.toString());
durron597
  • 31,968
  • 17
  • 99
  • 158
brtb
  • 2,201
  • 6
  • 34
  • 53
  • 1
    Note that you'll only get the source that is initially delivered when opening an url. There might be additional content being loaded via AJAX and you'd not see that content when you just read the initial stream. - As an example, open up http://demo.vaadin.com/sampler in Firefox and then open the page source code. You won't see the source for all the displayed content there. – Thomas Dec 23 '11 at 13:51
  • @cerq: Depending on your definition of *"web page's source code"* you can or you cannot do it. For example it can be argued that the "source code" of, say, a webpage generated by a *.jsp* is the *.jsp* file itself and **not** the generated HTML... What you're after is the HTML, not the "source code". In many case the "source code" is on the server and short of pirating the server you simply cannot access it. – TacticalCoder Dec 23 '11 at 13:53
  • @Thomas i think my problem is about the things you tell. So is there any way to get all displayed content source? – brtb Dec 23 '11 at 15:26
  • Well, you'd have to execute the JavaScript. Have a look at [ScriptEngineManager](http://docs.oracle.com/javase/7/docs/api/javax/script/ScriptEngineManager.html). – Thomas Dec 23 '11 at 19:52
  • I happen to be asking the exact same question, if you happen to found the answer, please post it here. Thanks! – Hendra Anggrian Jun 03 '14 at 18:55
  • 1
    Perhaps a duplicate of: [How do you Programmatically Download a Webpage in Java](http://stackoverflow.com/q/238547/642706). – Basil Bourque Jun 15 '14 at 01:29
  • People who look for a solution to these kind of problems can try the code below: – Ali Safari Feb 25 '20 at 21:56
  • URL pageURL = new URL("https://www.researchgate.net/"); BufferedReader in = new BufferedReader(new InputStreamReader(pageURL.openStream())); String fileName = "C:\\Users\\Ali\\Desktop\\test.html"; PrintWriter writer = new PrintWriter(fileName, "UTF-8"); String inputLine; while ((inputLine = in.readLine()) != null) { System.out.println(inputLine); writer.println(inputLine); } in.close(); – Ali Safari Feb 25 '20 at 21:57

3 Answers3

26

Try the following code with an added request property:

import java.io.BufferedReader;
import java.io.IOException;
import java.io.InputStream;
import java.io.InputStreamReader;
import java.net.URL;
import java.net.URLConnection;

public class SocketConnection
{
    public static String getURLSource(String url) throws IOException
    {
        URL urlObject = new URL(url);
        URLConnection urlConnection = urlObject.openConnection();
        urlConnection.setRequestProperty("User-Agent", "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.95 Safari/537.11");

        return toString(urlConnection.getInputStream());
    }

    private static String toString(InputStream inputStream) throws IOException
    {
        try (BufferedReader bufferedReader = new BufferedReader(new InputStreamReader(inputStream, "UTF-8")))
        {
            String inputLine;
            StringBuilder stringBuilder = new StringBuilder();
            while ((inputLine = bufferedReader.readLine()) != null)
            {
                stringBuilder.append(inputLine);
            }

            return stringBuilder.toString();
        }
    }
}
BullyWiiPlaza
  • 17,329
  • 10
  • 113
  • 185
narek.gevorgyan
  • 4,165
  • 5
  • 32
  • 52
3
URL yahoo = new URL("http://www.yahoo.com/");
BufferedReader in = new BufferedReader(
            new InputStreamReader(
            yahoo.openStream()));

String inputLine;

while ((inputLine = in.readLine()) != null)
    System.out.println(inputLine);

in.close();
subodh
  • 6,136
  • 12
  • 51
  • 73
  • i dont want a code which works for yahoo.com or google.com please check my post twice – brtb Dec 23 '11 at 14:24
2

I am sure that you have found a solution somewhere over the past 2 years but the following is a solution that works for your requested site

package javasandbox;

import java.io.BufferedReader;
import java.io.IOException;
import java.io.InputStreamReader;
import java.net.HttpURLConnection;
import java.net.MalformedURLException;
import java.net.URL;

/**
*
* @author Ryan.Oglesby
*/
public class JavaSandbox {

private static String sURL;

/**
 * @param args the command line arguments
 */
public static void main(String[] args) throws MalformedURLException, IOException {
    sURL = "http://www.cumhuriyet.com.tr/?hn=298710";
    System.out.println(sURL);
    URL url = new URL(sURL);
    HttpURLConnection httpCon = (HttpURLConnection) url.openConnection();
    //set http request headers
            httpCon.addRequestProperty("Host", "www.cumhuriyet.com.tr");
            httpCon.addRequestProperty("Connection", "keep-alive");
            httpCon.addRequestProperty("Cache-Control", "max-age=0");
            httpCon.addRequestProperty("Accept", "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8");
            httpCon.addRequestProperty("User-Agent", "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/30.0.1599.101 Safari/537.36");
            httpCon.addRequestProperty("Accept-Encoding", "gzip,deflate,sdch");
            httpCon.addRequestProperty("Accept-Language", "en-US,en;q=0.8");
            //httpCon.addRequestProperty("Cookie", "JSESSIONID=EC0F373FCC023CD3B8B9C1E2E2F7606C; lang=tr; __utma=169322547.1217782332.1386173665.1386173665.1386173665.1; __utmb=169322547.1.10.1386173665; __utmc=169322547; __utmz=169322547.1386173665.1.1.utmcsr=stackoverflow.com|utmccn=(referral)|utmcmd=referral|utmcct=/questions/8616781/how-to-get-a-web-pages-source-code-from-java; __gads=ID=3ab4e50d8713e391:T=1386173664:S=ALNI_Mb8N_wW0xS_wRa68vhR0gTRl8MwFA; scrElm=body");
            HttpURLConnection.setFollowRedirects(false);
            httpCon.setInstanceFollowRedirects(false);
            httpCon.setDoOutput(true);
            httpCon.setUseCaches(true);

            httpCon.setRequestMethod("GET");

            BufferedReader in = new BufferedReader(new InputStreamReader(httpCon.getInputStream(), "UTF-8"));
            String inputLine;
            StringBuilder a = new StringBuilder();
            while ((inputLine = in.readLine()) != null)
                a.append(inputLine);
            in.close();

            System.out.println(a.toString());

            httpCon.disconnect();
}
}
Roglesby
  • 59
  • 4
  • a help is never too late. But I tried your code and it doesn't work in many webpages. – Hendra Anggrian Jun 03 '14 at 18:51
  • 1
    I agree that this segment won't work against all web pages as different pages return the data in different formats and in some cases following redirects may be required for what you want to accomplish. in some cases you may receive the response as a gzip response and you could handle it as follows `InputStream gzippedResponse = httpCon.getInputStream(); InputStream ungzippedResponse = new GZIPInputStream(gzippedResponse); InputStreamReader reader = new InputStreamReader(ungzippedResponse, "UTF-8"); StringWriter writer = new StringWriter();` – Roglesby May 29 '15 at 18:41