Android: Extracting info from a website but not via its source code

Question

While I know how to extract contents of a website by URLConnection and BufferedReader and get its source code, sometimes a website is itself getting data from elsewhere and showing onto the page.

e.g. I am now working on this page http://bet.hkjc.com/marksix/userinfo.aspx?file=lucky_ocbs.asp&lang=en

and the 10 branches name and other details in the table in the page is not in the source code of the page.

Question:

Instead of extracting data from source code, is there any way to extract wordings simply from the final text showing in a page? If yes, how could it be done?

Thanks a lot.

score 2 · Answer 1 · edited May 23 '17 at 12:01

Yes, there is a way to extract the information from the website even if it performs some client side operations such as loading the data from an external website before displaying it. Although it'll be a very tricky solution and if you would have an opportunity to make an agreement with the website's owner and ask him to provide API to your application, I'd choose that option.

Ok, according to your question you can try to use Android's WebView to render the website first. Then just get the html content using one of the method described here. The most tricky part here is to make it in user friendly way. You have to cover a WebView with a progress bar while your app is waiting for onPageFinished callback from WebView. I'm not sure that WebView is acting properly in that case. But it's worth to try.

score 0 · Answer 2 · answered Oct 28 '16 at 22:49

0

Short Answer: You can't.

Reason: What renders the HTML is the client side. e.g: Browsers, Chrome, Firefox, IExplore, etc... Since you don't have a interpreter for the Markup Language you are unable to get only tag content ,even the browsers download all content, this is the HTTP behavior.

Workaround: Since you mentioned that some branches are not on page, i assume it is running on client side via some Javascript, what you can do is check what client is executing and perform via code). Since your client is the app.

Also see: Jsoup

answered Oct 28 '16 at 22:49

Anderson Oki

637
1
6
18

So even after it has finished loading and already showing in the webpage, I cannot extract the text from the webpage even it is displaying? – pearmak Oct 29 '16 at 05:05
what means a "regex" to "remove the HTML tags"? – pearmak Oct 29 '16 at 05:19

score 0 · Answer 3 · answered Nov 04 '16 at 18:19

You can not extract only your wanted information without download source html. after you downloaded source, you can use jsoup to iterate to only your wanted information.

add this to your app level build.gradle file

compile 'org.jsoup:jsoup:1.9.2'

then you can download and parse source code.

String url = "http://bet.hkjc.com/marksix/userinfo.aspx?file=lucky_ocbs.asp&lang=en";
InputStream input = new URL(url).openStream();      
Document doc = Jsoup.parse(input, "ISO-8859-9", url);

Elements sectionElements = doc.select("div#general-info-panel");
Elements imageElements = sectionElements.select("img[src]");

you need to convert above code block to your html page source code. you can find examples to how to use jsoup.

score -1 · Answer 4 · answered Oct 28 '16 at 18:34

-1

http://phantomjs.org/ can be used to extract a website's content after JavaScript execution. Not sure if they have an android build.

answered Oct 28 '16 at 18:34

David

231
1
8

Android: Extracting info from a website but not via its source code

Question:

4 Answers4