Is HTML parsing (in Java/Android) then extracting data from it, an effective way of getting a webpage's content?

Question

So, I'm using HTTP Post Requests in Android Java to log into a website, before extracting the entire HTML code. After that, I use Pattern/Matcher (regex) to find all the elements I need before extracting them from the HTML data, and deleting everything unnecessary. For instance when I extract this:

String extractions = <td>Good day sir</td>

Then I use:

extractions.replaceAll("<td>", "").replaceAll("</td>", "");

I do this multiple times until I have all the data needed from that site, before I display it in some kind of list.

I'm not particularly stuck on anything, but please, can you tell me if this is an effective/efficient/fast way of getting data from a page and processing it, or are there ways to do this faster? Because sometimes it's like my program takes a lot of time to get certain data (although mostly that's when I'm on 3G on my phone).

use jsoup for parsing HTML ... and about data ... if it's your website, build some webservice for getting data ... if not, maybe this site has API (like facebook, twitter, etc.) , if not you can also build your API for this site, by building webservice which will be proccess content of this website and you will access only processed data from android ... — Selvin, Apr 04 '12 at 08:57

score 0 · Answer 1 · edited May 23 '17 at 12:11

0

Using regex to parse a website is always a bad idea:

How to use regular expressions to parse HTML in Java?

Using regular expressions to parse HTML: why not?

edited May 23 '17 at 12:11

Community

1
1

answered Apr 04 '12 at 08:58

Christian Kuetbach

15,850
5
43
79

score 0 · Accepted Answer · answered Apr 04 '12 at 09:29

Like others have said, regex is not the best tool for this job. But in this case, the particular way you use regex is even more inefficient than it would normally be.

In any case, let me offer one more possible solution (depending on your use case).

It's called YQL (Yahoo Query Language). http://developer.yahoo.com/yql/

Here is a console for it so you can play around with it. http://developer.yahoo.com/yql/console/

YQL is the lazy developer's way to build your own api on the fly. The main inconvenience is that you have to use Yahoo as a go-between, but if you're ok with that, then I'd suggest you go that route. Using YQL is probably the quickest way to get that kind of work done (especially if the html you're targeting keeps on changing and if its html tags are not always valid).

score 0 · Answer 3 · answered Apr 04 '12 at 09:34

0

Have a look at the Apache Tika library for extracting text from HTML - there are many other parsers also available, such as PDF etc. : http://tika.apache.org/

answered Apr 04 '12 at 09:34

Clive van Hilten

851
5
16
32

Is HTML parsing (in Java/Android) then extracting data from it, an effective way of getting a webpage's content?

3 Answers3