0

I'm trying to extract a specific value from a web page so that I can pull it into a Google Sheets spreadsheet. The problem is that the page isn't structured in a way that makes the values easy to pull.

Given the HTML below, can anyone suggest a way to pull "$4,586" from the TD element after the one that contains "Prop Taxes"? There are lots of TDs on the page that have a class of "d97m50". There are also lots of tables with a class of "d97m2".

I tried the following but couldn't get either of them to work. For the first one, I couldn't determine a way to iterate through the TDs on the page, find the TD after the one containing "Prop Taxes" and extract just the text from it. The second one failed because I couldn't determine a regex that would do the same thing.

<TABLE class="d97m2" cellSpacing=0 cellPadding=0 sizset="false" sizcache06358115873960983="276 82 150">
<!-- A bunch of other rows -->
<TR>
<TD class="d97m40"><span class="label">Prop Taxes:</SPAN></TD>
<TD class="d97m50" colSpan=2><SPAN class="wrapped-field">$4,586</span></TD>
<TD class="d97m43"><span class="label d97m29">Garbage:</SPAN></TD>
<TD class="d97m26"><SPAN class="wrapped-field">$0</span></TD>
<TD class="d97m44"><span class="label">Parking Inc:</SPAN></TD>
<TD class="d97m45"><SPAN class="wrapped-field">$0</span></TD>
<TD class="d97m46"><span class="label">TOE:</SPAN></TD>
<TD class="d97m47"><SPAN class="wrapped-field">$10,248</span></TD></TR>
<TR>
<!-- a bunch more rows -->
</TABLE>
Community
  • 1
  • 1
Kris
  • 33
  • 3

2 Answers2

0

One fairly simple way to pull the table is to use the importhtml function within Sheets, for example:

=importhtml("http://www.tradingeconomics.com/zambia/rating","table",1)
Jan Krynauw
  • 1,042
  • 10
  • 21
  • Nice workaround Jann. After a little trial-and-error to find the right table number (there were 30 of them on the page) and a vlookup, I was able to make this work. I was hoping for more of an Apps Script based approach though - so that I could simply call a function and have it return the appropriate value. I'll mark this as the correct answer if nobody provides anything like that. – Kris Feb 02 '17 at 15:57
0

If you can get the HTML you are wishing to process into a Javascript String object, you could use a RegEx to identify the particular string you're after.

For example, given your test text:

<TABLE class="d97m2" cellSpacing=0 cellPadding=0 sizset="false"      sizcache06358115873960983="276 82 150">
<!-- A bunch of other rows -->
<TR>
<TD class="d97m40"><span class="label">Prop Taxes:</SPAN></TD>
<TD class="d97m50" colSpan=2><SPAN class="wrapped-field">$4,586</span></TD>
<TD class="d97m43"><span class="label d97m29">Garbage:</SPAN></TD>
<TD class="d97m26"><SPAN class="wrapped-field">$0</span></TD>
<TD class="d97m44"><span class="label">Parking Inc:</SPAN></TD>
<TD class="d97m45"><SPAN class="wrapped-field">$0</span></TD>
<TD class="d97m46"><span class="label">TOE:</SPAN></TD>
<TD class="d97m47"><SPAN class="wrapped-field">$10,248</span></TD></TR>
<TR>
<!-- a bunch more rows -->
</TABLE>

The following regular expression:

/.*?Prop\sTaxes(.|\s)*?d97m50.*?\$(.*?)<\/span/mg

will produce in its second match the value "4,586", which you can then process as you wish.

Here's an example answer that shows how to get multiple matches and process them.

Javascript Regular Expression multiple match

This code works for me:

function regExTest() {
  var s = '<TABLE class="d97m2" cellSpacing=0 cellPadding=0 sizset="false"      sizcache06358115873960983="276 82 150">' +
    '<!-- A bunch of other rows -->' +
    '<TR>' +
    '<TD class="d97m40"><span class="label">Prop Taxes:</SPAN></TD>' +
    '<TD class="d97m50" colSpan=2><SPAN class="wrapped-field">$1,986</span></TD>' +
    '<TD class="d97m43"><span class="label d97m29">Garbage:</SPAN></TD>' +
    '<TD class="d97m26"><SPAN class="wrapped-field">$0</span></TD>' +
    '<TD class="d97m44"><span class="label">Parking Inc:</SPAN></TD>' +
    '<TD class="d97m45"><SPAN class="wrapped-field">$0</span></TD>' +
    '<TD class="d97m46"><span class="label">TOE:</SPAN></TD>' +
    '<TD class="d97m47"><SPAN class="wrapped-field">$10,248</span></TD></TR>' +
    '<TR>' +
    '<TR>' +
    '<TD class="d97m40"><span class="label">Prop Taxes:</SPAN></TD>' +
    '<TD class="d97m50" colSpan=2><SPAN class="wrapped-field">$4,586</span></TD>' +
    '<TD class="d97m43"><span class="label d97m29">Garbage:</SPAN></TD>' +
    '<TD class="d97m26"><SPAN class="wrapped-field">$0</span></TD>' +
    '<TD class="d97m44"><span class="label">Parking Inc:</SPAN></TD>' +
    '<TD class="d97m45"><SPAN class="wrapped-field">$0</span></TD>' +
    '<TD class="d97m46"><span class="label">TOE:</SPAN></TD>' +
    '<TD class="d97m47"><SPAN class="wrapped-field">$10,248</span></TD></TR>' +
    '<TR>' +
    '<TR>' +
    '<TD class="d97m40"><span class="label">Prop Taxes:</SPAN></TD>' +
    '<TD class="d97m50" colSpan=2><SPAN class="wrapped-field">$2,514</span></TD>' +
    '<TD class="d97m43"><span class="label d97m29">Garbage:</SPAN></TD>' +
    '<TD class="d97m26"><SPAN class="wrapped-field">$0</span></TD>' +
    '<TD class="d97m44"><span class="label">Parking Inc:</SPAN></TD>' +
    '<TD class="d97m45"><SPAN class="wrapped-field">$0</span></TD>' +
    '<TD class="d97m46"><span class="label">TOE:</SPAN></TD>' +
    '<TD class="d97m47"><SPAN class="wrapped-field">$10,248</span></TD></TR>' +
    '<TR>' +
    '<TR>' +
    '<TD class="d97m40"><span class="label">Prop Taxes:</SPAN></TD>' +
    '<TD class="d97m50" colSpan=2><SPAN class="wrapped-field">$3,312</span></TD>' +
    '<TD class="d97m43"><span class="label d97m29">Garbage:</SPAN></TD>' +
    '<TD class="d97m26"><SPAN class="wrapped-field">$0</span></TD>' +
    '<TD class="d97m44"><span class="label">Parking Inc:</SPAN></TD>' +
    '<TD class="d97m45"><SPAN class="wrapped-field">$0</span></TD>' +
    '<TD class="d97m46"><span class="label">TOE:</SPAN></TD>' +
    '<TD class="d97m47"><SPAN class="wrapped-field">$10,248</span></TD></TR>' +
    '<TR>' +
    '<!-- a bunch more rows -->' +
    '</TABLE>';

  var qualityRegex = /.*?Prop\sTaxes(.|\s)*?d97m50.*?\$(.*?)<\/span/mg,
      matches = [];

  var match = qualityRegex.exec(s);
  while (match != null) {
      matches.push(match[2]);
      match = qualityRegex.exec(s);
  }

  /* Matches now contains the numbers you require */
}
Community
  • 1
  • 1
alfiethecoder
  • 343
  • 1
  • 2
  • 11
  • Tried the regex but it kept timing out. Tried using both an Apps Script function and [RegExr](http://regexr.com/). Both had the same result. – Kris Feb 02 '17 at 15:47
  • The code I tried was:`var page = UrlFetchApp.fetch(url).getContentText(); var propTax = page.match(/.*Prop\sTaxes(.|\s)*d97m50.*\$(.*)<\/span/mg)[2];` – Kris Feb 02 '17 at 16:05
  • How odd. It didn't indeed work. So in my changed code I've made all of the * quantifiers non-greedy, by adding a question mark after them and they now work. Try the function and see if it works for you. – alfiethecoder Feb 03 '17 at 15:27