How to solve error while parsing HTML

Question

I´m trying to get the elements from a web page in Google spreadsheet using:

function pegarAsCoisas() {
  var html = UrlFetchApp.fetch("http://www.saosilvestre.com.br").getContentText();
  var elements = XmlService.parse(html);                 
}

However I keep geting the error:

Error on line 2: Attribute name "itemscope" associated with an element type "html" must be followed by the ' = ' character. (line 4, file "")

How do I solve this? I want to get the H1 text from this site, but for other sites I´ll have to select other elements.

I know the method XmlService.parse(html) works for other sites, like Wikipedia. As you can see here.

score 2 · Answer 1 · answered Nov 25 '15 at 22:33

2

The html isn't xml. And you don't need to try to parse it. You need to use string methods:

function pegarAsCoisas() {

  var urlFetchReturn = UrlFetchApp.fetch("http://www.saosilvestre.com.br");
  var html = urlFetchReturn.getContentText();

  Logger.log('html.length: ' + html.length);

  var index_OfH1 = html.indexOf('<h1');
  var endingH1 = html.indexOf('</h1>');

  Logger.log('index_OfH1: ' + index_OfH1);
  Logger.log('endingH1: ' + endingH1);

  var h1Content = html.slice(index_OfH1, endingH1);
  var h1Content = h1Content.slice(h1Content.indexOf(">")+1);

  Logger.log('h1Content: ' + h1Content);

};

answered Nov 25 '15 at 22:33

Alan Wells

30,746
15
104
152

I was using regex.exec(), but your method is way simpler... however, being able to parse the HTML would make may life much easier, since I would be able to select by ID, Class, etc... in more complex pages.. And you can parse a HTML using XmlService.parse(html); Some webpages, like wikipedia works just fine... – user3347814 Nov 26 '15 at 20:45
I think that to be able to use DOM methods, you'd need to pass the HTML back to the front end; which can be done. But again, I don't think you'll need to parse it. If you do know of a way to select ID, Class, etc in `.gs` server side code, let me know how to do it. – Alan Wells Nov 26 '15 at 23:13
Working March 2017! `indexOf` allows a [second parameter](https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/String/indexOf) to adjust search starting point. More details on the [Class Logger from Google Apps Script](https://developers.google.com/apps-script/reference/base/logger). – joelhaus Mar 25 '17 at 21:33

score 1 · Answer 2 · edited May 23 '17 at 12:17

1

The XMLService service works only with 100% correct XML content. It's not error tolerant. Google apps script used to have a tolerant service called XML service but it was deprecated. However, it still works and you can use that instead as explained here: GAS-XML

edited May 23 '17 at 12:17

Community

1
1

answered Sep 20 '16 at 16:21

Sujay Phadke

2,145
1
22
41

score 1 · Answer 3 · edited May 23 '17 at 12:17

Technically HTML and XHTML are not the same. See What are the main differences between XHTML and HTML?

Regarding the OP code, the following works just fine

function pegarAsCoisas() {
  var html =  UrlFetchApp
    .fetch('http://www.saosilvestre.com.br')
    .getContentText();
  Logger.log(html);
}

As was said on previous answers, other methods should be used instead of using the XmlService directly on the object returned by UrlFetchApp. You could try first to convert the web page source code from HTML to XHTML in order to be able to use the Xml Service Service (XmlService), use the Xml Service as it could work directly with HTML pages, or to handle the web page source code directly as a text file.

How to solve error while parsing HTML

4 Answers4