0

I need to parse a data a website that using jQuery to Generate a table from their database, and they don't want to provide the data to any other way (plain html, xml etc) for me. On my previous experience, I easily can parse the data from html file directly because the data is inside the html file it self, but on this case the table is seem to be generated in browser memory and if I try to parse from the html, the only thing I get is the javascript(jquery) itself and there is no < TABLE> < TD> or < TR> tag insides.

The question is, is there a way get those Table in plain HTML? (i'm expecting the solution is in android/Java but other language/platform are welcome too)

EDIT: For those who want to see the example of the data, I can't give the real data but following example is the exact example of the data I need to parse :

http://datatables.net/examples/data_sources/server_side.html

There is the table but if you open the source of the HTML, you cannot find the data, somehow it is generated inside the memory of the browser after the html is loaded and then take it from server

As requested by Saranya Sadhasivam, below is Example data output

aaData: [[916, LATE, 14:38, SUCCESS, null], [532, EARLY, 14:42, SUCCESS, null],…]

iTotalDisplayRecords: 15

iTotalRecords: 15

oa00f43afb3246649816c727d67db0df9476346d5:"QBUSRAQOQQEWVw8SWlIEURZNRVwMTkEUSBUQCxAGXB9EV04SQVsYSF9AChBaUxFbH3NhK0oDBVQDXgZ‌​WWgUGOjljNWY0NGVj"

sEcho: 1

BOUNTY TERMS and CONDITION :

The first person that can parse table data from following link without accessing the server side data:

http://datatables.net/examples/data_sources/server_side.html

In Android Only

DoogyHtw
  • 341
  • 1
  • 3
  • 16
  • Can you send the code what you have tried? – Saranya Sadhasivam Aug 30 '13 at 05:43
  • You can also use jquery/javascript to create the table and rows/cells and populate it with the data you're getting. Can you please show the data – Jez D Aug 30 '13 at 05:50
  • Hi Saranaya/Jez D, I cannot give you the real sites, but if you want the example this one is exactly what i need to parse http://datatables.net/examples/data_sources/server_side.html , remember I cannot access the Server Sides so any Server Sides solution is will not works, and I need to parse it directly. – DoogyHtw Aug 30 '13 at 05:51
  • @FerryHtw, What is the return data from server side? – Saranya Sadhasivam Aug 30 '13 at 06:32
  • Hi Saranaya, I don't know I cannot access the server side, bit it appear on browser, if you open this link http://datatables.net/examples/data_sources/server_side.html this is exactly the data i need to parse, I need to be able to parse the data (Gecko, Firefox etc..) on the table shown there, the problem is I cannot parse it because there are no < TABLE> < TD> < TR> tags in there for the table. – DoogyHtw Aug 30 '13 at 06:41
  • @FerryHtw, Is the data returned in JSON or XML or HTML? Without knowing the return type, we can not proceed anything over here. – Saranya Sadhasivam Aug 30 '13 at 06:55
  • @Saranya : I'm sorry, I have a limited knowledge in Javascript/JQuery, and as "google suggestion" I have try looking it using Chrome debuger and I think its JSON it look like (CMIIW) : aaData: [[916, LATE, 14:38, SUCCESS, null], [532, EARLY, 14:42, SUCCESS, null],…] iTotalDisplayRecords: 15 iTotalRecords: 15 oa00f43afb3246649816c727d67db0df9476346d5: "QBUSRAQOQQEWVw8SWlIEURZNRVwMTkEUSBUQCxAGXB9EV04SQVsYSF9AChBaUxFbH3NhK0oDBVQDXgZWWgUGOjljNWY0NGVj" sEcho: 1 (I will put it also above for easy read for you) – DoogyHtw Aug 30 '13 at 07:36
  • You are not getting any data from server side. It is showing only total records as 15. There is no data. If it is json, you can see this link http://api.jquery.com/jQuery.parseJSON/ to parse json – Saranya Sadhasivam Aug 30 '13 at 07:40
  • @Saranaya : the data I ready want is already there [[916, LATE, 14:38, SUCCESS, null], [532, EARLY, 14:42, SUCCESS, null] etc.. but i don't know how to get it, and I think i can parse it easily. The goal is i'm going to parse it and store it in SQLite database (On Android). – DoogyHtw Aug 30 '13 at 07:42
  • @FerryHtw Are you loading this page in a WebView in Android? Another question, Make a request to the server side url is prohibited? – Raúl Juárez Sep 02 '13 at 04:58
  • @Raul make request is prohibited. you just be able to open the URL and somehow read the table. – DoogyHtw Sep 02 '13 at 10:22
  • Not sure how you are able to get the HTML without querying the server, but let's assume you have this page loaded up in a browser somewhere already. You can ask jQuery to give you the HTML: `$('
    ').append($('#example')).html()`.
    – John Tseng Sep 03 '13 at 00:05
  • If you want only the data (which comes via ajax)you can call the http://datatables.net/examples/examples_support/server_processing.php for full link please see your console ajax requests – pszaba Sep 03 '13 at 11:55

4 Answers4

5

Your goal is misguided, because you make a false assertion in your question.

and they don't want to provide the data to any other way

Which is not true based on your example from this page. If the real data is as you say

but on this case the table is seem to be generated in browser memory and if I try to parse from the html, the only thing I get is the javascript(jquery) itself and there is no < TABLE> < TD> or < TR> tag insides.

this seems to indicate that the site is using AJAX to query the data in JSON, and then generate the data table. This means that the data IS provided in another way - JSON. Now your question becomes not "How can I Parse HTML table generated by jQuery?" but rather, "How can I parse JSON in Android," in which case this question holds your answer.

I realize that this answer doesn't solve the question as asked, but it really is the correct way to do it. You don't want to parse complex tables generated from a jQuery plugin (which could easily change) if the data is already available in a standard data format (JSON).

Edit: I'm not concerned about earning the bounty since I didn't answer with the exact parameters defined by the bounty condition, but I really think you're making the problem harder than it is, and putting unnecessary constraints on yourself by saying you can only parse the HTML page, and not a JSON output from endpoint that the HTML page itself uses.

Edit 2: (From my comment on the asker's answer) Here's a metaphor of the situation. You need some wood to build a shed. You decide to hire a contractor to build you a house, then decide to take the house apart in order to get to the wood to build your shed. You ask "how can I best take apart the house to get the wood?" to which I respond "Don't. Go to the store and buy the wood directly."

Community
  • 1
  • 1
xdumaine
  • 10,096
  • 6
  • 62
  • 103
  • You answered it correctly. The asker is the one that is either mistaken, or is leaving out important information. – David Bradbury Sep 03 '13 at 20:01
  • After a research and wondering, I found a correct "Google Keyword" to find what I want and I think I I have found a way to and it's do able. The keyword "GUI-Less browser" or "Headless Browser" and that exactly what I need, I don't have to think about the serversides, json and response, just "if the browser can load it, then scrapt the table from it". Some example (but not ideal) example are from this link http://stackoverflow.com/questions/17399055/android-web-scraping-with-a-headless-browser , I will update the answer when I could me it works. – DoogyHtw Sep 03 '13 at 23:48
  • @FerryHtw You missed the point of my answer. JSON data output is DESIGNED to do what you're trying to do. Instead of loading the web page, you load the JSON endpoint (this is done over HTTP(s) just the same as HTML). You're overcomplicating the problem. – xdumaine Sep 04 '13 at 11:46
  • @xdumain: The reason why I cant thru JSON due to the JSON url where I Should send the request is changes for every time I load the pages, the previous web developer has make sure that nobody cannot access to Server Sides data and nobody can develop the websites. Thats why in the T&C i mention that I can't access server sides data and I Know IT'S HARD, thats why I Made this BOUNTY. But now I know it's Possible. BTW, sorry if you miss understood what I wrote above, english is not my everyday language but I try hard to do my best. – DoogyHtw Sep 05 '13 at 12:06
  • @FerryHtw Accessing JSON is no more "accessing server side data" than accessing HTML. If the JSON url changes, just scrape *that* and do the AJAX call. Are you saying that the previous developer has added some kind of token based authentication to the AJAX call? That's so ridiculous. – xdumaine Sep 05 '13 at 12:22
  • This is really the best approach to getting structured data from a page where data is fetched via JavaScript. The thing is this is not easy, it's just a lot easier that faking a browser and extracting HTML. There is nothing a developer can do that works via HTTP that can't be faked with a library like HttpClient (apache commons). You just need to pay attention to the request header that yor browser sends when you fetch the page. If there is a cookie then you will need to work out how to fetch that cookie. – Jason Sperske Sep 08 '13 at 17:19
0

I think you will have some trouble parsing using Android, but you can use a server to parser and use it to send the data to Android handle with it. To do that you can use the Mechanize with Firefox extension to handle with javascript. You need that because mechanize alone can't handle with JS, only browsers. And the data in the table is generated after the page onLoad (so you need handle with JS, and that is why you cant parser directly in the html).

There is a Mechanize for Java too.

You can use other options, in this post is showed options of real web browser to handle with JS. I never used those options but you can try it.

Community
  • 1
  • 1
Scoup
  • 1,323
  • 8
  • 11
0

If the data is in a jQuery DataTables object, as in the example, you should use $("#example").DataTable().fnGetData(). The data is not visible as HTML in source-code because it is generated dynamically as you point out above. There may be some form of the data present in source code, perhaps JSON in a hidden input, or it may be in an external file or fetched via AJAX, but there is nothing wrong with accessing it after it has been parsed for you by DataTables.

Obviously you just need to use the id of the DataTable instance as your selector in the first term. Running the line above for the example returns the data in the following format:

[["Gecko", "Firefox 1.0", "Win 98+ / OSX.2+", 2 more...], ["Gecko", "Firefox 1.5", "Win 98+ / OSX.2+", 2 more...], ["Gecko", "Firefox 2.0", "Win 98+ / OSX.2+", 2 more...], ["Gecko", "Firefox 3.0", "Win 2k+ / OSX.3+", 2 more...], ["Gecko", "Camino 1.0", "OSX.2+", 2 more...], ["Gecko", "Camino 1.5", "OSX.3+", 2 more...], ["Gecko", "Netscape 7.2", "Win 95+ / Mac OS 8.6-9.2", 2 more...], ["Gecko", "Netscape Browser 8", "Win 98SE+", 2 more...], ["Gecko", "Netscape Navigator 9", "Win 98+ / OSX.2+", 2 more...], ["Gecko", "Mozilla 1.0", "Win 95+ / OSX.1+", 2 more...]]

If the data is fetched via AJAX, and it is paginated, this method is no longer ideal. But if you truly need a front-end-only solution, as you're suggesting, you could still use this general method with a few twists.

cage rattler
  • 1,587
  • 10
  • 16
-1

After a research and wondering, I found a correct "Google Keyword" to find what I want and I think I I have found a way to and it's do able. The keyword is "GUI-Less browser" or "Headless Browser" and that exactly what I need, I don't have to think about the server sides data, json and response, just "if the browser can load it, and can run the Javascript, and you can see the table in it, then scrap the table from it". Some example (but not ideal one) example are from this link

Android Web Scraping with a Headless Browser

I will update the answer then I will confirmed if it works and what method of I will used

Community
  • 1
  • 1
DoogyHtw
  • 341
  • 1
  • 3
  • 16
  • This is NOT the way to do this. The web page does an ajax call for JSON data. You want to load that same exact data. Why scrape the HTML when you can read the JSON result directly? This is like saying "I'd like to hire a contractor to build a house, then I'd like to take apart the house to use the wood." To which I'd respond, "why don't you just buy the wood from the store yourself?" – xdumaine Sep 04 '13 at 11:49