1

I'm writing an android app that takes relevant data from a website and presents it to the user (html scraping). The application downloads the source code and parses it, looking for relevant data to store in objects. I actually made a parser using JSoup, but it turned out to be really slow in my app. Also, these libraries tend to be rather large, and I want my app to be lightweight.

The webpages I'm trying to parse all have a similar structure and I know exactly what tags I'm looking for. So I figured I might as well download the source code and read it line by line, looking for the relevant data, using String.equals. For example, if the html would look like this:

<textTag class="text">I want this text</textTag>

I would parse it using methods like:

private void interpretHtml(String s){
    if(s.startsWidth("<textTag class=\"text\"")){
        String text = s.substring(22, s.length() - 10);
    }
}

However, I have very little knowledge about setting up connections (I've seen people use HttpGets, but I'm not entirely sure how to get data from that). I've searched for quite some time looking for information on how to parse like this, but most people often resort to using libraries like JSoup, SAX, etc. to do parsing.

Does anyone happen to have some information on how to do parsing like this, maybe an example? Or is it a bad idea to parse source code in this way? Please give me your opinion.

Thank you for your time.

Bhesh Gurung
  • 50,430
  • 22
  • 93
  • 142
Xae
  • 43
  • 1
  • 3
  • 1
    Check out http://stackoverflow.com/a/1732454/894284 . BTW, SO generally frowns on subjective questions -- see the faq http://stackoverflow.com/faq#dontask. – Matt Fenwick Dec 12 '11 at 20:11
  • Why dont you use http://jsoup.org/ – coder_For_Life22 Dec 12 '11 at 20:18
  • @MattFenwick Thank you for your comment. Looking back on my question I can see how it can be considered subjective, but I can imagine alot of people are asking themselves this question. Also, I have actually seen the post you're referring to. However, I also noticed the post below it, saying that "it's sometimes appropriate to parse a limited, known set of HTML". – Xae Dec 12 '11 at 20:22

3 Answers3

1

to get a webpage in java you'll find a code on the bottom of this answer.

you can use reg-expressions.

here's a nice reference

android regex

but, if the html is well written you can also try with yahoo's yql. it outputs as json or xml so you can grab it really easy after.

yahoo yql console

personalty, I parse them in python or in php because I feel more comfortable in those languages.

get webpage: How to use it:

Get_Webpage obj = new Get_Webpage("http://your_url_here"); Sting source = obj.get_webpage_source();


public class Get_Webpage {

    public String parsing_url = "";

    public Get_Webpage(String url_2_get){       
        parsing_url = url_2_get;
    }

    public String get_webpage_source(){

        HttpClient client = new DefaultHttpClient();
        HttpGet request = new HttpGet(parsing_url);
        HttpResponse response = null;
        try {
            response = client.execute(request);
        } catch (ClientProtocolException e) {

        } catch (IOException e) {

        }

        String html = "";
        InputStream in = null;
        try {
            in = response.getEntity().getContent();
        } catch (IllegalStateException e) {

        } catch (IOException e) {

        }
        BufferedReader reader = new BufferedReader(new InputStreamReader(in));
        StringBuilder str = new StringBuilder();
        String line = null;
        try {
            while((line = reader.readLine()) != null)
            {
                str.append(line);
            }
        } catch (IOException e) {

        }
        try {
            in.close();
        } catch (IOException e) {

        }
        html = str.toString();

        return html;
    }

}
OWADVL
  • 10,704
  • 7
  • 55
  • 67
  • Thanks alot for your reply! It seems to be exactly what I need. I do have another question about your code though. Would it be a good idea to actually start parsing in the while loop, so that I can parse line by line, instead of writing the whole html source to a string? – Xae Dec 12 '11 at 21:25
  • 1
    that 0.000001 second difference you will make it's not worth complicating the code. write all functions separately so you can have a clear head about the hole program. – OWADVL Dec 13 '11 at 13:17
1

Here is how i would do it:

        StringBuffer text = new StringBuffer();
        HttpURLConnection conn = null;
        InputStreamReader in = null;
        BufferedReader buff = null;
        try {
            URL page = new URL(
                    "http://example.com/");
// URLEncoder.encode(someparameter); use when passing params that may contain symbols or spaces use URLEncoder to encode it and conver space to %20...etc other wise you will get a 404
            conn = (HttpURLConnection) page.openConnection();
            conn.connect();
            /* use this if you need to
            int responseCode = conn.getResponseCode();

            if (responseCode == 401 || responseCode == 403) {
                // Authorization Error
                Log.e(tag, "Authorization Error");
                throw new Exception("Authorization Error");
            }

            if (responseCode >= 500 && responseCode <= 504) {
                // Server Error
                Log.e(tag, "Internal Server Error");
                throw new Exception("Internal Server Error");
            }*/
            in = new InputStreamReader((InputStream) conn.getContent());
            buff = new BufferedReader(in);
            String line = "anything";
            while (line != null) {
                line = buff.readLine();
            String found = interpretHtml(line);
            if(null != found)
                return found; // comment the previous 2 lines and this one if u need to load the whole html document.
                text.append(line + "\n");
            }
        } catch (Exception e) {
            Log.e(Standards.tag,
                    "Exception while getting html from website, exception: "
                            + e.toString() + ", cause: " + e.getCause()
                            + ", message: " + e.getMessage());
        } finally {
            if (null != buff) {
                try {
                    buff.close();
                } catch (IOException e1) {
                }
                buff = null;
            }
            if (null != in) {
                try {
                    in.close();
                } catch (IOException e1) {
                }
                in = null;
            }
            if (null != conn) {
                conn.disconnect();
                conn = null;
            }
        }
        if (text.toString().length() > 0) {
            return interpretHtml(text.toString()); // use this if you don't need to load the whole page.
        } else return null;
    }

private String interpretHtml(String s){
    if(s.startsWidth("<textTag class=\"text\"")){
    return s.substring(22, s.length() - 10);
    }
    return null;
}
Shereef Marzouk
  • 3,282
  • 7
  • 41
  • 64
0

I would say it's probably a bad idea to parse HTML on the device if you're experiencing performance issues. Have you considered creating a web app that your device app fetches data from?

If the data is from one source (i.e.; one webpage and not many) I would build a web app to prefetch the site, parse for relevant data, and cache it for later use on the device(s).

John Giotta
  • 16,432
  • 7
  • 52
  • 82