0

I'm trying to grab a table from the following webpage

http://www.bloomberg.com/markets/companies/country/hong-kong/

I have some sample code which was kindly provided by Phil Bozak here:

grabbing table from html using Google script

which grabs the table for this website:

http://www.airchina.com.cn/www/en/html/index/ir/traffic/

As you can see from Phil's code, there is alot of "getElement()" in the code. If i look at the html code for the Air China website. It looks like it's nested four times? that's why the string of .getElement?

Now I look at the source code for the Bloomberg page and its is load with "div"...

the question is can someone show me how to grab the table from this the Bloomberg page?

and just a brief explanation of the theory also would be useful. Thanks a bunch.

Rubén
  • 34,714
  • 9
  • 70
  • 166
jason
  • 3,811
  • 18
  • 92
  • 147

1 Answers1

6

Let's flip your question upside down, and start with the theory. Methodology might be a better word for it.

You want to get at something specific in a structured page. To do that, you either need a way to zap right to the element (which can be done if it's labeled in a unique way that we can access), OR you need to navigate the structure more-or-less manually. You already know how to look at the source of a page, so you're familiar with this step. Here's a screenshot of Firefox Inspector, highlighting the element we're interested in.

Screenshot - Firefox Inspector

We can see the hierarchy of elements that lead to the table: html, body, div, div, div.ticker, table.ticker_data. We can also see the source:

<table class="ticker_data">

Neat! It's labeled! Unfortunately, that class info gets dropped when we process the HTML in our script. Bummer. If it was id="ticker_data" instead, we could use the getElementByVal() utility from this answer to reach it, and give ourselves some immunity from future restructuring of the page. Put a pin in that - we'll come back to it.

It can help to visualize this in the debugger. Here's a utility script for that - run it in debug mode, and you'll have your HTML document laid out to explore:

/**
 * Debug-run this in the editor to be able to explore the structure of web pages.
 *
 * Set target to the page you're interested in.
 */
function pageExplorer() {
  var target = "http://www.bloomberg.com/markets/companies/country/hong-kong/";
  var pageTxt = UrlFetchApp.fetch(target).getContentText();
  var pageDoc = Xml.parse(pageTxt,true);
  debugger;  // Pause in debugger - explore pageDoc
}

This is what our page looks like in the debugger:

Screenshot - debugger

You might be wondering what the numbered elements are, since you don't see them in the source. When there are multiples of an element type at the same level in an XML document, the parser presents them as an array, numbered 0..n. Thus, when we see 0 under a div in the debugger, that's telling us that there are multiple <div> tags in the HTML source at that level, and we can access them as an array, for example .div[0].

Ok, theory behind us, let's go ahead and see how we can access the table by brute-force.

Knowing the hierarchy, including the div arrays shown in the debugger, we could do this, ala Phil's previous answer. I'll do some weird indenting to illustrate the document structure:

...
var target = "http://www.bloomberg.com/markets/companies/country/hong-kong/";
var pageTxt = UrlFetchApp.fetch(target).getContentText();
var pageDoc = Xml.parse(pageTxt,true);
var table = pageDoc.getElement()
             .getElement("body")
               .getElements("div")[0]      // 0-th div under body, shown in debugger
                 .getElements("div")[5]    // 5-th div under there
                   .getElement("div")      // another div
                     .getElement("table"); // finally, our table

As a much more compact alternative to all those .getElement() calls, we can navigate using dot notation.

var table = pageDoc.getElement().body.div[0].div[5].div.table;

And that's that.

Let's go back to that pinned idea. In the debugger, we can see that there are various attributes attached to elements. In particular, there's an "id" on that div[5] that contains the div that contains the table. Remember, in the source we saw "class" attributes, but note that they don't make it this far.

Screenshot - debugger 2

Still, the fact that a kindly programmer put this "id" in place means we can do this, with getDivById() from that earlier question:

var contentDiv = getDivById( pageDoc.getElement().body, 'content' );
var table = contentDiv.div.table;

If they move things around, we might still be able to find that table, without changing our code.

You already know what to do once you have the table element, so we're done here!

Community
  • 1
  • 1
Mogsdad
  • 44,709
  • 21
  • 151
  • 275
  • wow. what an answer. i haven't even gone through it yet, but i'm already accepting it. I appreciate the help. let me eat breakfast first and brew a nice pot of tea and go through it. Thanks again! – jason Jun 01 '13 at 00:00
  • 1
    Edited to explain arrays of elements. – Mogsdad Jun 01 '13 at 02:27
  • Mogsdad. I'm using Phil's code to get the elements into an array to put into google spreadsheet. the first column doesn't seem to be showing up for some reason ie. blank. Seems like there is 4 elements to the 'td' in each 'tr' and i'm looping through it. but the first 'td' shows up blank. – jason Jun 01 '13 at 05:52
  • 1
  • How did you navigate to the table directly? is there a search function? I'm now working on a different page. I'm using the debugger but there is alot of branches, how do i figure out how to find the table? – jason Jun 03 '13 at 01:10
  • 1
    Start in the html source. Find your table. Work backwards to find out what elements contain the table. If there are multiple similar elements within the same container, they will become an array - count from 0 to know which one you need. Use the debugger to confirm your work. ... MUCH easier to use Firefox Inspector, or Chrome Inspector, because you can start by clicking on the visual element you want, and see `id` or other tags you can use to zap in. – Mogsdad Jun 03 '13 at 01:25
  • Got it. I'll try that. How come when I try the shortcut var contentDiv = getDivById( pageDoc.getElement().body, 'content' ); it gives me 'ReferenceError: "getDivById" is not defined' ? – jason Jun 03 '13 at 01:32