How to parse an HTML string in Google Apps Script without using XmlService?

Question

I want to create a scraper using Google Spreadsheets with Google Apps Script. I know it is possible and I have seen some tutorials and threads about it.

The main idea is to use:

  var html = UrlFetchApp.fetch('http://en.wikipedia.org/wiki/Document_Object_Model').getContentText();
  var doc = XmlService.parse(html);

And then get and work with the elements. However, the method

XmlService.parse()

Does not work for some page. For example, if I try:

function test(){
    var html = UrlFetchApp.fetch("https://www.nespresso.com/br/pt/product/maquina-de-cafe-espresso-pixie-clips-preto-lima-neon-c60-220v").getContentText();
    var parse = XmlService.parse(html);
}

I get the following error:

Error on line 225: The entity name must immediately follow the '&' in the entity reference. (line 3, file "")

I've tried to use string.replace() to eliminate the characters that apparently are causing the error, but it does not work. All sort of other errors appear. The following code for example:

function test(){
    var html = UrlFetchApp.fetch("https://www.nespresso.com/br/pt/product/maquina-de-cafe-espresso-pixie-clips-preto-lima-neon-c60-220v").getContentText();
    var regExp = new RegExp("&", "gi");
    html = html.replace(regExp,"");

    var parse = XmlService.parse(html);
}

Gives me the following error:

Error on line 358: The content of elements must consist of well-formed character data or markup. (line 6, file "")

I believe this is a problem with the XmlService.parse() method.

I've read in this threads:

Google App Script parse table from messed html and What is the best way to parse html in google apps script that one can use a deprecated method called xml.parse() which does accept a second parameter that allows parsing HTML. However, as I've mentioned, it is deprecated and I can not find any documentation on it anywhere. The xml.parse() seems to parse the string, but I have trouble working with the elements due to the lack of documentation. And it's also not the safest long term solution, because it can be deactivated any time soon.

So, I want to know how do I parse this HTML in Google Apps Script?

I also tried:

function test(){

    var html = UrlFetchApp.fetch("https://www.nespresso.com/br/pt/product/maquina-de-cafe-espresso-pixie-clips-preto-lima-neon-c60-220v").getContentText();
    var htmlOutput = HtmlService.createHtmlOutput(html).getContent();

    var parse = XmlService.parse(htmlOutput);
}

But it does not work, I get this error:

Malformed HTML content:

I thought about using a open source library to parse the HTML, but I could not find any.

My ultimate goal is to get some information from a set of pages like Price, Link, Name of the products, etc. I've manage to do this using a series of RegEx:

var ss = SpreadsheetApp.getActiveSpreadsheet();
  var linksSheet = ss.getSheetByName("Links");
  var resultadosSheet = ss.getSheetByName("Resultados");

function scrapyLoco(){

  var links = linksSheet.getRange(1, 1, linksSheet.getLastRow(), 1).getValues();
  var arrayGrandao = [];
  for (var row =  0, len = links.length; row < len; row++){
   var link = links[row];


   var arrayDeResultados = pegarAsCoisas(link[0]);
   Logger.log(arrayDeResultados);
   arrayGrandao.push(arrayDeResultados);
  }   


  resultadosSheet.getRange(2, 1, arrayGrandao.length, arrayGrandao[0].length).setValues(arrayGrandao);

}


function pegarAsCoisas(linkDoProduto) {
  var resultadoArray = [];

  var html = UrlFetchApp.fetch(linkDoProduto).getContentText();
  var regExp = new RegExp("<h1([^]*)h1>", "gi");
  var h1Html = regExp.exec(html);
  var h1Parse = XmlService.parse(h1Html[0]);
  var h1Output = h1Parse.getRootElement().getText();
  h1Output = h1Output.replace(/(\r\n|\n|\r|(^( )*))/gm,"");

  regExp = new RegExp("Ref.: ([^(])*", "gi");
  var codeHtml = regExp.exec(html);
  var codeOutput = codeHtml[0].replace("Ref.: ","").replace(" ","");

  regExp = new RegExp("margin-top: 5px; margin-bottom: 5px; padding: 5px; background-color: #699D15; color: #fff; text-align: center;([^]*)/div>", "gi");
  var descriptionHtml = regExp.exec(html);
  var regExp = new RegExp("<p([^]*)p>", "gi");
  var descriptionHtml = regExp.exec(descriptionHtml);
  var regExp = new RegExp("^[^.]*", "gi");
  var descriptionHtml = regExp.exec(descriptionHtml);
  var descriptionOutput = descriptionHtml[0].replace("<p>","");
  descriptionOutput = descriptionOutput+".";

  regExp = new RegExp("ecom(.+?)Main.png", "gi");
  var imageHtml = regExp.exec(html);
  var comecoDaURL = "https://www.nespresso.com/";
  var imageOutput = comecoDaURL+imageHtml[0];

  var regExp = new RegExp("nes_l-float nes_big-price nes_big-price-with-out([^]*)p>", "gi");
  var precoHtml = regExp.exec(html);
  var regExp = new RegExp("[0-9]*,", "gi");
  precoHtml = regExp.exec(precoHtml);
  var precoOutput = "BRL "+precoHtml[0].replace(",","");

  resultadoArray = [codeOutput,h1Output,descriptionOutput,"Home & Garden > Kitchen & Dining > Kitchen Appliances > Coffee Makers & Espresso Machines",
                    "Máquina",linkDoProduto,imageOutput,"new","in stock",precoOutput,"","","","Nespresso",codeOutput];

  return resultadoArray;
}

But this is very timing consuming to program, it is very hard to change it dynamically and is not very reliable.

I need a way to parse this HTML and easily access its elements. It´s actually not a add on. but a simple google app script..

Does this answer your question? [Parse HTML string using CSS selectors](https://stackoverflow.com/questions/11348946/parse-html-string-using-css-selectors) — Kos, Sep 14 '21 at 18:24

score 37 · Answer 1 · edited Nov 05 '19 at 07:12

37

I made cheeriogs for your problem. it's works on GAS as cheerio which is jQuery-like api. You can do that like this.

const content = UrlFetchApp.fetch('https://example.co/').getContentText();
const $ = Cheerio.load(content);
Logger.log($('p .blah').first().text()); // blah blah blah ...

See also https://github.com/asciian/cheeriogs

edited Nov 05 '19 at 07:12

Julien

9,312
10
63
86

answered Jun 14 '18 at 11:50

asciian

379
1
3
2

1

It doesn't work. It won't even run. Once I add the library, google script won't even debug. – toddmo Nov 09 '19 at 17:52
@toddmo sometimes, it runs – cikatomo Sep 11 '20 at 04:48
v12 is working wonderfully for me. – thdoan Mar 28 '21 at 19:34
This works exactly how I wanted! – DSdavidDS Dec 02 '21 at 20:37

score 13 · Answer 2 · edited Sep 15 '21 at 02:12

13

This has been discussed before - see this Q&A.

Unlike XML service, the XMLService is not very forgiving of malformed HTML. The trick in the answer by Justin Bicknell does the job. Even though XML service has been deprecated, it still continues to work.

edited Sep 15 '21 at 02:12

Oleg Valter is with Ukraine

9,619
8
37
59

answered Feb 02 '16 at 00:10

Sujay Phadke

2,145
1
22
41

6

_XMLService_ isn't even forgiving of well-formed html. html is not xml, and an xml parser will choke on even basic modern html. – Mashmagar Jan 14 '20 at 21:12
3

Xml.parse is completely removed from Google Scripts. – Chirag Jun 14 '21 at 04:43

score 11 · Accepted Answer · answered Jun 18 '17 at 20:34

I have done this in vanilla js. Not real html parsing. Just try to get some content out of a string (url):

function getLKKBTC() {
  var url = 'https://www.lykke.com/exchange';
  var html = UrlFetchApp.fetch(url).getContentText();
  var searchstring = '<td class="ask_BTCLKK">';
  var index = html.search(searchstring);
  if (index >= 0) {
    var pos = index + searchstring.length
    var rate = html.substring(pos, pos + 6);
    rate = parseFloat(rate)
    rate = 1/rate
    return parseFloat(rate);
  }
  throw "Failed to fetch/parse data from " + url;
}

score 4 · Answer 4 · answered Dec 09 '15 at 15:31

Please be aware that certain web sites may not permit automated scraping of their content, so please consult their terms or service before using Apps Script to extract the content.

The XmlService only works against valid XML documents, and most HTML (especially HTML5), is not valid XML. A previous version of the XmlService, simply called Xml, allowed for "lenient" parsing, which would allow it to parse HTML as well. This service was sunset in 2013, but for the time being still functions. The reference docs are no longer available, but this old tutorial shows it's usage.

Another alternative is to use a service like Kimono, which handles the scraping and parsing parts and provides a simple API you can call via UrlFetchApp to retrieve the structured data.

Kimono was acquired and shut down in February, 2016. [Portia](https://blog.scrapinghub.com/2016/02/17/portia-alternative-to-kimono/) is an open-source alternative. — Chris, May 30 '18 at 22:22

score 3 · Answer 5 · answered Aug 22 '20 at 04:57

3

I had some good luck today just by massaging the html:

// close unclosed tags
html = html.replace(/(<(?=link|meta|br|input)[^>]*)(?<!\/)>/ig, '$1/>')
// force script / style content into cdata
html = html.replace(/(<(script|style)[^>]*>)/ig, '$1<![CDATA[').replace(/(<\/(script|style)[^>]*>)/ig, ']]>$1')
// change & to &amp;
html = html.replace(/&(?!amp;)/g, '&amp;')
// now it works! (tested with original url)
let document = XmlService.parse(html)

answered Aug 22 '20 at 04:57

pguardiario

53,827
19
119
159

1

works well, I added img tags to this – MC9000 Feb 19 '22 at 21:47
It doesn't work for me. I tried to add the img to the first line, but not sure I did it right. Can we get this code updated so it works correctly like, most of the time? – Jason Cramer Mar 25 '22 at 20:19

score 1 · Answer 6 · answered Jan 08 '17 at 16:18

1

I´ve found a very neat alternative to scrape using Google App Script. It is called PhantomJS Cloud. One can use the urlFetchApp to access the API. This allows to execute Jquery code on the pages, which makes life so much simpler.

answered Jan 08 '17 at 16:18

user3347814

1,138
9
28
50

score 0 · Answer 7 · edited May 23 '17 at 12:17

0

Could you use javascript to parse the html? If your Google Apps Script retrieved the html as a string and then returned it to a javascript function, it seems like you could parse it just fine outside of the Google Apps script. Any tags you want to scrape, you could send to a dedicated Google Apps function that would save the content.

You could probably accomplish this more easily with jQuery.

edited May 23 '17 at 12:17

Community

1
1

answered Nov 25 '15 at 23:10

Eric Dauenhauer

710
6
23

1

I need to run it on googleappscript. I didn´t find a way to run Jquery in appscript. – user3347814 Nov 26 '15 at 15:53
I didn't read the OP closely enough - I wrote my answer imagining that you were making a web app and could pass data easily between a javascript file and your google app script file. I don't know how to accomplish that result in a Spreadsheet add on. – Eric Dauenhauer Nov 28 '15 at 02:31

score 0 · Answer 8 · answered Dec 23 '18 at 02:31

maybe not the cleanest approach, but simple string processing does the job too without xmlservice:

var url = 'https://somewebsite.com/?q=00:11:22:33:44:55';
var html = UrlFetchApp.fetch(url).getContentText();
// we want only the link text displayed from here:
//<td><a href="/company/ubiquiti-networks-inc">Ubiquiti Networks Inc.</a></td>
var string1 = html.split('<td><a href="/company/')[1]; // all after '<td><a href="/company/'
var string2 = string1.split('</a></td>')[0];           // all before '</a></td>'
var string3 = string2.split('>')[1];                   // all after '>'
Logger.log('link text: '+string3);                     // string3 => "Ubiquiti Networks Inc."

As a beginner, I appreciate your use of 'vanilla' constructs. — aNewb, Apr 24 '21 at 03:10

How to parse an HTML string in Google Apps Script without using XmlService?

8 Answers8

Linked