What is the best way to parse html in google apps script

Question

var page = UrlFetchApp.fetch(contestURL);
var doc = XmlService.parse(page);

The above code gives a parse error when used, however if I replace the XmlService class with the deprecated Xml class, with the lenient flag set, it parses the html properly.

var page = UrlFetchApp.fetch(contestURL);
var doc = Xml.parse(page, true);

The problem is mostly caused because of no CDATA in the javascript part of the html and the parser complains with the following error.

The entity name must immediately follow the '&' in the entity reference.

Even if I remove all the <script>(.*?)</script> using regex, it still complains because the <br> tags aren't closed. Is there a clean way of parsing html into a DOM tree.

You could possibly fetch the page and using a parser to strip out the unwants tags and then initiate another parse, may be possible. I do know that XmlService works well but is kind of buggy. — hwnd, Oct 18 '13 at 18:07

Justin Bicknell · Answer 1 · 2017-04-24T16:58:29.720

34

I ran into this exact same problem. I was able to circumvent it by first using the deprecated Xml.parse, since it still works, then selecting the body XmlElement, then passing in its Xml String into the new XmlService.parse method:

var page = UrlFetchApp.fetch(contestURL);
var doc = Xml.parse(page, true);
var bodyHtml = doc.html.body.toXmlString();
doc = XmlService.parse(bodyHtml);
var root = doc.getRootElement();

Note: This solution may not work if the old Xml.parse is completely removed from Google Scripts.

edited Apr 24 '17 at 16:58

answered Apr 15 '14 at 03:20

Justin Bicknell

4,804
18
26

1

What about completely ill structured 'html' documents that do not validate and XmlService.parse just chokes on them? – imrek Apr 08 '16 at 03:41
2

doc.html.body is an array for me for some reason, and each element seems to be different – Mingwei Samuel Apr 18 '16 at 07:24
I think the javascript is causing elements to be ended early – Mingwei Samuel Apr 18 '16 at 07:31
4

It's great because it allows you to use the well-documented XmlService, which would otherwise be unusable, as it has trouble parsing HTML files. In case anyone is wondering how to select elements, this will help: https://sites.google.com/site/scriptsexamples/learn-by-example/parsing-html (Sorry for the new commit, it was too late to edit..) – Liran H Oct 09 '17 at 18:59
I was getting an error saying .toXmlString() could not be found. If you get this you need to make it: var bodyHtml = doc.html.body[0].toXmlString(); It is because of multiple body tags in the document which is a rare thing but for some reason I have them in mine. – BrinkDaDrink Mar 17 '18 at 15:18
11

Note from 2020: Xml.parse has indeed been removed. – J. G. Oct 04 '20 at 18:06

thdoan · Answer 2 · 2021-03-29T06:21:06.537

26

In 2021, the best way to parse HTML on the .gs side that I know of is...

Click + next to Library
Enter 1ReeQ6WO8kKNxoaA_O0XEQ589cIrRvEBA9qcWpNqdOP17i47u6N9M5Xh0
Click "Look up"
Click Add
Sample usage:

const contentText = UrlFetchApp.fetch('https://www.somesite.com/').getContentText();
const $ = Cheerio.load(contentText);

$('.some-class').first().text();

That's it -- this is probably the closest we'll get to doing jQuery-like DOM selection in GAS. The .first() is important or else you may extract more content than you expected (think of it as using querySelector() instead of querySelectorAll()).

Credit where credit is due: https://github.com/tani/cheeriogs

edited Mar 29 '21 at 06:21

answered Mar 28 '21 at 19:55

thdoan

18,421
1
62
57

1

This is the best answer I found in all my search today. I coded it and it works perfectly fine. Combining this information with blog on medium (https://medium.com/@stefanhyltoft/scraping-html-tables-with-nodejs-request-and-cheerio-e3c6334f661b) and some Cheerio documentation I was able make it work for a very complex HTML table parsing covid hospitals data. – Farooq Khan May 10 '21 at 21:10

score 13 · Answer 3 · edited Sep 13 '21 at 15:25

13

As of May 2020, you can now use the Cheerio library for Google Apps Script to do this.

Returns the content of Wikipedia's Main Page

const content = getContent_('https://en.wikipedia.org');
const $ = Cheerio.load(content);
Logger.log($('#mp-right').text());

Returns the content of the first paragraph <p> of Wikipedia's Main Page

const content = getContent_('https://en.wikipedia.org');
const $ = Cheerio.load(content);
Logger.log($('p').first().text());

To add to your project:

Select Resources - Libraries... in the Google Apps Script editor. Enter the project key 1ReeQ6WO8kKNxoaA_O0XEQ589cIrRvEBA9qcWpNqdOP17i47u6N9M5Xh0 in the Add a library field, and click "Add". Select the highest version number, and click "Save".

edited Sep 13 '21 at 15:25

Kos

4,890
9
38
42

answered May 21 '20 at 05:44

Let Me Tink About It

15,156
21
98
207

It's odd that this answer has less votes than basically the same answer a year later. Credit where credit is due. – T Nguyen Nov 24 '21 at 07:22
1

@TNguyen Because this post is merged from another question with lesser views. – TheMaster Nov 30 '21 at 23:32

score 5 · Answer 4 · answered Jan 14 '17 at 17:12

I found that the best way to parse html in google apps is to avoid using XmlService.parse or Xml.parse. XmlService.parse doesn't work well with bad html code from certain websites.

Here a basic example on how you can parse any website easily without using XmlService.parse or Xml.parse. In this example, i am retrieving a list of president from "wikipedia.org/wiki/President_of_the_United_States" whit a regular javascript document.getElementsByTagName(), and pasting the values into my google spreadsheet.

1- Create a new Google Sheet;

2- Click the menu Tools > Script editor... to open a new tab with the code editor window and copy the following code into your Code.gs:

function onOpen() {
 var ui = SpreadsheetApp.getUi();
    ui.createMenu("Parse Menu")
      .addItem("Parse", "parserMenuItem")
      .addToUi();

}


function parserMenuItem() {
  var sideBar = HtmlService.createHtmlOutputFromFile("test");
  SpreadsheetApp.getUi().showSidebar(sideBar);
}


function getUrlData(url) {
 var doc = UrlFetchApp.fetch(url).getContentText()
 return doc                               
}

function writeToSpreadSheet(data) {
 var ss = SpreadsheetApp.getActiveSpreadsheet();
 var sheet = ss.getSheets()[0];
 var row=1

   for (var i = 0; i < data.length; i++) {
   var x = data[i];
   var range = sheet.getRange(row, 1)
   range.setValue(x);
   var row = row+1
    }
}

3- Add an HTML file to your Apps Script project. Open the Script Editor and choose File > New > Html File, and name it 'test'.Then copy the following code into your test.html

<!DOCTYPE html>
<html>
<head>    
</head>
<body>
<input id= "mButon" type="button" value="Click here to get list"
onclick="parse()">
<div hidden id="mOutput"></div>
</body>
<script>

window.onload = onOpen;

function onOpen() {
 var url = "https://en.wikipedia.org/wiki/President_of_the_United_States"
 google.script.run.withSuccessHandler(writeHtmlOutput).getUrlData(url)
 document.getElementById("mButon").style.visibility = "visible";
}

function writeHtmlOutput(x) {
 document.getElementById('mOutput').innerHTML = x;
}

function parse() {

var list = document.getElementsByTagName("area");
var data = [];

   for (var i = 0; i < list.length; i++) {
   var x = list[i];
   data.push(x.getAttribute("title"))
    }

google.script.run.writeToSpreadSheet(data);
} 
</script> 
</html>

4- Save your gs and html files and Go back to your spreadsheet. Reload your Spreadsheet. Click on "Parse Menu" - "Parse". Then click on "Click here to get list" in the sidebar.

It seems that there is some risk of this allowing whatever content you read in from over the wire to run scripts when its added under the mOutput div. If you are loading html from a 3rd party server I would recommend sticking the output in a sandboxed iframe with scripting disabled. — Ben Kelly, Dec 28 '20 at 21:02
Or I guess even better would be to use DOMParser on the string to create a Document without ever adding it to the active DOM. https://developer.mozilla.org/en-US/docs/Web/API/DOMParser — Ben Kelly, Dec 29 '20 at 00:04

score 4 · Answer 5 · answered Oct 22 '13 at 20:46

4

Xml.parse() has an option to turn on lenient parsing, which helps when parsing HTML. Note that the Xml service is deprecated however, and the newer XmlService doesn't have this functionality.

answered Oct 22 '13 at 20:46

Eric Koleda

12,420
1
33
51

score 4 · Answer 6 · edited Dec 22 '19 at 05:23

4

For simple tasks such as grabbing one value from a webpage, you could use a regular expression. Regex is notoriously bad for parsing HTML as there's all sorts of weird cases it can get tripped up, but if you're confident about the HTML you're accessing this can sometimes be the simplest way.

Here's an example that fetches the contents of the page's <title> tag:

var page = UrlFetchApp.fetch(contestURL);
var regExp = new RegExp("<title>(.*)</title>", "gi");
var result = regExp.exec(page.getContentText());
// [1] is the match group when using parenthesis in the pattern
var value = result ? result[1] : 'No title found';

edited Dec 22 '19 at 05:23

Simon East

55,742
17
139
133

answered Jul 24 '15 at 09:36

Ivan de Leon

41
3

The last row returns ```null``` for me. – imrek Mar 26 '16 at 22:14
2

it is generally a Very Bad Idea(tm) to use regexen to parse html/xml :https://blog.codinghorror.com/parsing-html-the-cthulhu-way/ – jtatria Jan 17 '17 at 20:02
@jtatria You are right. But in case of the title it might be safe. A non-greedy version may be better. – ceving Aug 16 '21 at 09:29

Jindřich Širůček · Answer 7 · 2021-04-04T14:39:20.337

4

I know it is not exactly what OP asked, but I found this question when I was looking for some html parsing options - so it might be useful for others as well.

There is an easy to use the library for TEXT parsing. It's useful if you want to get only one piece of information from the html(xml) code.

EDIT 2021: The script library id is:
1Mc8BthYthXx6CoIz90-JiSzSafVnT6U3t0z_W3hLTAX5ek4w0G_EIrNw

It works like in the picture above

function getData() {
    var url = "https://chrome.google.com/webstore/detail/signaturesatori-central-s/fejomcfhljndadjlojamaklegghjnjfn?hl=en";
    var fromText = '<span class="e-f-ih" title="';
    var toText = '">';
  
    var content = UrlFetchApp.fetch(url).getContentText();
    var scraped = Parser
                    .data(content)
                    .from(fromText)
                    .to(toText)
                    .build();
    Logger.log(scraped);
    return scraped;
}

edited Apr 04 '21 at 14:39

answered Jan 02 '19 at 13:08

Jindřich Širůček

334
1
16

script says Parser is not defined. – Mani Jan 09 '21 at 08:15
2

2021: The script id is: `1Mc8BthYthXx6CoIz90-JiSzSafVnT6U3t0z_W3hLTAX5ek4w0G_EIrNw` – Patrick Wozniak Apr 03 '21 at 12:01
not safe - "lib:parser:8" want access to "gdrive" – Bruno Jun 08 '22 at 19:41

Max Makhrov · Answer 8 · 2021-11-26T11:43:11.107

If you are using

Cheerio library for Google Apps Script

Source code

Library page (⭐ star it!)

Installation by library ID:

1ReeQ6WO8kKNxoaA_O0XEQ589cIrRvEBA9qcWpNqdOP17i47u6N9M5Xh0

A function to get current emojis from unicode.org:

function getEmojis() {
  var t = new Date();
  var url = 'https://unicode.org/emoji/charts/full-emoji-list.html';
  var fetch = UrlFetchApp.fetch(url);
  var contentText = fetch.getContentText();
  //console.log(new Date() - t);
  
  // Cherio
  var $ = Cheerio.load(contentText);
  var data = [];
  $("table > tbody > tr").each((index, element) => {
    var row = [];
    $(element).find("td").each((index, child) => {
      row.push($(child).text());
    });
    if (row.length > 0) {
      data.push(row);
    }    
  });

  //console.log(data);
  //console.log(new Date() - t);

  // Result
  return data;

}

↑ Sample code shows how to parse table and put it into [[array]]

May be used as a custom function:

Bonus

Parsing the site may be a time-consuming operation + you may reach the limit. Here's a test file with a full version of the script:

https://docs.google.com/spreadsheets/d/1iO7YjYWyfseQu_YCfRbGDPg7NskOgMu_iO1iGjr7KxY/edit#gid=93365395

↑ it uses CasheService to reduce the number of calls.

score 1 · Answer 9 · edited May 21 '20 at 03:18

1

There are two options

a) One is to use JavaScript's string functions. First locate your tag using string.indexOf() and then extract the data you want using string.substring().

b) The other option is to make use of the Xml Service.

edited May 21 '20 at 03:18

Let Me Tink About It

15,156
21
98
207

answered Jul 05 '12 at 16:59

Srik

7,907
2
20
29

3

Option a) is not convenient enough. Option b) would be good if it had XPath selection. I can't find such function. – rik Jul 05 '12 at 17:14

score 1 · Answer 10 · answered Oct 18 '13 at 17:48

1

Natively there's no way unless you do what you already tried which wont work if the html doesnt conform with the xml format.

answered Oct 18 '13 at 17:48

Zig Mandel

19,571
5
26
36

score -1 · Answer 11 · answered Jul 19 '12 at 00:36

-1

It's not possible to create an HTML DOM server-side in Apps Script. Using regular expressions is likely your best option, at least for simple parsing.

answered Jul 19 '12 at 00:36

Eric Koleda

12,420
1
33
51

What is the best way to parse html in google apps script

11 Answers11

Cheerio library for Google Apps Script

Linked

Related