2

I´m trying to get the elements from a web page in Google spreadsheet using:

function pegarAsCoisas() {
  var html = UrlFetchApp.fetch("http://www.saosilvestre.com.br").getContentText();
  var elements = XmlService.parse(html);                 
}

However I keep geting the error:

Error on line 2: Attribute name "itemscope" associated with an element type "html" must be followed by the ' = ' character. (line 4, file "")

How do I solve this? I want to get the H1 text from this site, but for other sites I´ll have to select other elements.

I know the method XmlService.parse(html) works for other sites, like Wikipedia. As you can see here.

Rubén
  • 34,714
  • 9
  • 70
  • 166
user3347814
  • 1,138
  • 9
  • 28
  • 50

4 Answers4

2

The html isn't xml. And you don't need to try to parse it. You need to use string methods:

function pegarAsCoisas() {

  var urlFetchReturn = UrlFetchApp.fetch("http://www.saosilvestre.com.br");
  var html = urlFetchReturn.getContentText();

  Logger.log('html.length: ' + html.length);

  var index_OfH1 = html.indexOf('<h1');
  var endingH1 = html.indexOf('</h1>');

  Logger.log('index_OfH1: ' + index_OfH1);
  Logger.log('endingH1: ' + endingH1);

  var h1Content = html.slice(index_OfH1, endingH1);
  var h1Content = h1Content.slice(h1Content.indexOf(">")+1);

  Logger.log('h1Content: ' + h1Content);

};
Alan Wells
  • 30,746
  • 15
  • 104
  • 152
  • I was using regex.exec(), but your method is way simpler... however, being able to parse the HTML would make may life much easier, since I would be able to select by ID, Class, etc... in more complex pages.. And you can parse a HTML using XmlService.parse(html); Some webpages, like wikipedia works just fine... – user3347814 Nov 26 '15 at 20:45
  • I think that to be able to use DOM methods, you'd need to pass the HTML back to the front end; which can be done. But again, I don't think you'll need to parse it. If you do know of a way to select ID, Class, etc in `.gs` server side code, let me know how to do it. – Alan Wells Nov 26 '15 at 23:13
  • Working March 2017! `indexOf` allows a [second parameter](https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/String/indexOf) to adjust search starting point. More details on the [Class Logger from Google Apps Script](https://developers.google.com/apps-script/reference/base/logger). – joelhaus Mar 25 '17 at 21:33
1

The XMLService service works only with 100% correct XML content. It's not error tolerant. Google apps script used to have a tolerant service called XML service but it was deprecated. However, it still works and you can use that instead as explained here: GAS-XML

Community
  • 1
  • 1
Sujay Phadke
  • 2,145
  • 1
  • 22
  • 41
1

Technically HTML and XHTML are not the same. See What are the main differences between XHTML and HTML?

Regarding the OP code, the following works just fine

function pegarAsCoisas() {
  var html =  UrlFetchApp
    .fetch('http://www.saosilvestre.com.br')
    .getContentText();
  Logger.log(html);
}

As was said on previous answers, other methods should be used instead of using the XmlService directly on the object returned by UrlFetchApp. You could try first to convert the web page source code from HTML to XHTML in order to be able to use the Xml Service Service (XmlService), use the Xml Service as it could work directly with HTML pages, or to handle the web page source code directly as a text file.

Related questions:

Community
  • 1
  • 1
Rubén
  • 34,714
  • 9
  • 70
  • 166
-1

Try replace itemscope by itemscope = '':

function pegarAsCoisas() {
  var html = UrlFetchApp.fetch("http://www.saosilvestre.com.br").getContentText();
  html = replace("itemscope", "itemscope = ''");
  var elements = XmlService.parse(html);                 
}

For more information, look here.

Alex M
  • 2,756
  • 7
  • 29
  • 35