27

I'm working on an application that needs to get the source of a web page from a link, and then parse the html from that page.

Could you give me some examples, or starting points where to look to start writing such an app?

halfer
  • 19,824
  • 17
  • 99
  • 186
Praveen
  • 90,477
  • 74
  • 177
  • 219
  • It is not totally clear what you want to do? I guess you want to get the web page and then parse the html? – Janusz Mar 11 '10 at 08:43
  • i am working on html parsing. first task i want to get html source from my html link. how to do that? sorry for my worst english. thanks for encouraging me. – Praveen Mar 11 '10 at 09:09
  • No problem I tried to rephrase your question a bit. I hope is is still the same question :) For further questions, your question is very broad. We like questions that are a little bit more special and have a single problem in your app maybe with some example code to explain your problem... – Janusz Mar 11 '10 at 09:24

8 Answers8

47

You can use HttpClient to perform an HTTP GET and retrieve the HTML response, something like this:

HttpClient client = new DefaultHttpClient();
HttpGet request = new HttpGet(url);
HttpResponse response = client.execute(request);

String html = "";
InputStream in = response.getEntity().getContent();
BufferedReader reader = new BufferedReader(new InputStreamReader(in));
StringBuilder str = new StringBuilder();
String line = null;
while((line = reader.readLine()) != null)
{
    str.append(line);
}
in.close();
html = str.toString();
Mark B
  • 183,023
  • 24
  • 297
  • 295
  • 2
    To bad I am getting an unknown host exception but I can open a browser to my same URL. – Rhyous Oct 24 '11 at 04:14
  • 9
    Got the unknown host exception too, for me it was a rights issue, added this ` ` to the manifest – Michel Jan 23 '12 at 10:01
  • Is there any way to read all content in one step, without reading line by line? – Mehmed Feb 24 '13 at 11:21
  • I'm getting a "NullReferenceException" with my url being `new URI("http://www.google.com/")`. Any permissions required other than "android.permission.INTERNET"? – Kamran Ahmed May 28 '13 at 06:32
  • 1
    Why not use `String html = EntityUtils.toString(response.getEntity());` – ben Jun 22 '13 at 14:53
  • How i get just some line of html-source code of webpage ? if i get full source page it may get so much time and i dont need to get all line, can you help me? thx very much – Milad gh Sep 17 '15 at 04:05
  • Please mention what we should use instead of HttpClient as well. – Utkarsh Sinha Feb 13 '17 at 19:52
  • See my answer below if you're looking for an answer that doesn't use the now deprecated HttpClient. – Colin White Mar 03 '17 at 06:41
17

I would suggest jsoup.

According to their website:

Fetch the Wikipedia homepage, parse it to a DOM, and select the headlines from the In the news section into a list of Elements (online sample):

Document doc = Jsoup.connect("http://en.wikipedia.org/").get();
Elements newsHeadlines = doc.select("#mp-itn b a");

Getting started:

  1. Download the jsoup jar core library
  2. Read the cookbook introduction
halfer
  • 19,824
  • 17
  • 99
  • 186
Paul Spiesberger
  • 5,630
  • 1
  • 43
  • 53
14

This question is a bit old, but I figured I should post my answer now that DefaultHttpClient, HttpGet, etc. are deprecated. This function should get and return HTML, given a URL.

public static String getHtml(String url) throws IOException {
    // Build and set timeout values for the request.
    URLConnection connection = (new URL(url)).openConnection();
    connection.setConnectTimeout(5000);
    connection.setReadTimeout(5000);
    connection.connect();

    // Read and store the result line by line then return the entire string.
    InputStream in = connection.getInputStream();
    BufferedReader reader = new BufferedReader(new InputStreamReader(in));
    StringBuilder html = new StringBuilder();
    for (String line; (line = reader.readLine()) != null; ) {
        html.append(line);
    }
    in.close();

    return html.toString();
}
Colin White
  • 1,041
  • 11
  • 22
6
public class RetrieveSiteData extends AsyncTask<String, Void, String> {
@Override
protected String doInBackground(String... urls) {
    StringBuilder builder = new StringBuilder(100000);

    for (String url : urls) {
        DefaultHttpClient client = new DefaultHttpClient();
        HttpGet httpGet = new HttpGet(url);
        try {
            HttpResponse execute = client.execute(httpGet);
            InputStream content = execute.getEntity().getContent();

            BufferedReader buffer = new BufferedReader(new InputStreamReader(content));
            String s = "";
            while ((s = buffer.readLine()) != null) {
                builder.append(s);
            }

        } catch (Exception e) {
            e.printStackTrace();
        }
    }

    return builder.toString();
}

@Override
protected void onPostExecute(String result) {

}
}
Julian
  • 71
  • 1
  • 6
1

Call it like

new RetrieveFeedTask(new OnTaskFinished()
        {
            @Override
            public void onFeedRetrieved(String feeds)
            {
                //do whatever you want to do with the feeds
            }
        }).execute("http://enterurlhere.com");

RetrieveFeedTask.class

class RetrieveFeedTask extends AsyncTask<String, Void, String>
{
    String HTML_response= "";

    OnTaskFinished onOurTaskFinished;


    public RetrieveFeedTask(OnTaskFinished onTaskFinished)
    {
        onOurTaskFinished = onTaskFinished;
    }
    @Override
    protected void onPreExecute()
    {
        super.onPreExecute();
    }

    @Override
    protected String doInBackground(String... urls)
    {
        try
        {
            URL url = new URL(urls[0]); // enter your url here which to download

            URLConnection conn = url.openConnection();

            // open the stream and put it into BufferedReader
            BufferedReader br = new BufferedReader(new InputStreamReader(conn.getInputStream()));

            String inputLine;

            while ((inputLine = br.readLine()) != null)
            {
                // System.out.println(inputLine);
                HTML_response += inputLine;
            }
            br.close();

            System.out.println("Done");

        }
        catch (MalformedURLException e)
        {
            e.printStackTrace();
        }
        catch (IOException e)
        {
            e.printStackTrace();
        }
        return HTML_response;
    }

    @Override
    protected void onPostExecute(String feed)
    {
        onOurTaskFinished.onFeedRetrieved(feed);
    }
}

OnTaskFinished.java

public interface OnTaskFinished
{
    public void onFeedRetrieved(String feeds);
}
Zar E Ahmer
  • 33,936
  • 20
  • 234
  • 300
0

If you have a look here or here, you will see that you can't do that directly with android API, you need an external librairy...

You can choose between the 2 here's hereabove if you need an external librairy.

Sephy
  • 50,022
  • 30
  • 123
  • 131
  • 1
    that depends on the kind of webpage you have and want to parse. If you are only looking for some specific values you totally can grab this values with some regular expression :) I would only use a new external lib if the use case for that library is complicated enough – Janusz Mar 11 '10 at 09:19
  • fair enough. Regex are quite easy to go with.but then you need to load the whole page and grab each tag you're interested with a custom regex in aren't you? – Sephy Mar 11 '10 at 10:51
  • before using regex we need to get the html source as a string. how to do that? – Praveen Mar 11 '10 at 12:54
0

One of the other SO post answer helped me. This doesn't read line by line; supposingly the html file had a line null in between. As preRequisite add this dependancy from project settings "com.koushikdutta.ion:ion:2.2.1" implement this code in AsyncTASK. If you want the returned -something- to be in UI thread, pass it to a mutual interface.

Ion.with(getApplicationContext()).
load("https://google.com/hashbrowns")
.asString()
.setCallback(new FutureCallback<String>()
 {
        @Override
        public void onCompleted(Exception e, String result) {
            //int s = result.lastIndexOf("user_id")+9;
            // String st = result.substring(s,s+5);
           // Log.e("USERID",st); //something

        }
    });
0
public class DownloadTask extends AsyncTask<String, Void, String> {

        @Override
        protected String doInBackground(String... urls) {

            String result = "";
            URL url;
            HttpsURLConnection urlConnection = null;

            try {
                url = new URL(urls[0]);

                urlConnection = (HttpsURLConnection) url.openConnection();

                BufferedReader br = new BufferedReader(new InputStreamReader(urlConnection.getInputStream()));


                String inputLine;

                while ((inputLine = br.readLine()) != null)
                {
                    // System.out.println(inputLine);
                    result += inputLine;
                }
                br.close();
                return result;
            } catch (Exception e) {
                e.printStackTrace();
                return "failed";
            }
        }
    }

    @Override
    protected void onCreate(Bundle savedInstanceState) {
        super.onCreate(savedInstanceState);
        setContentView(R.layout.activity_main);

        DownloadTask task = new DownloadTask();

        String result = null;

        try {
            result = task.execute("https://www.example.com").get();
        }catch (Exception e){

            e.printStackTrace();
        }
        Log.i("Result", result);

    }
Kuruchy
  • 1,330
  • 1
  • 14
  • 26
  • 1
    Hi Ashique Hira Manzil, welcome to StackOverflow. I would suggest to add more than just code as an answer. Also take into account the post is 10 years old. And Asynk Tasks are deprecated by Android. – Kuruchy Apr 14 '20 at 17:25