0

I'm trying to parse a webpage from which I'm trying to extract some URLs.

[...]
var html = UrlFetchApp.fetch('https://cse.google.com/cse?q=example&cx=006680642033474972217%3A6zo0hx_wle8#gsc.tab=0&gsc.q=example&gsc.page=1').getContentText();
var doc = XmlService.parse(html);
[...]

The URL in this code is an example and in the future the word "example", in both occurrences, may be something else.

When I run the code, XmlService.parse() fails and gives me the error in the title.

I am aware that the webpage has some markup messed up.

The problem is that I can't fix the markup once and solve the problem everywhere else, as I have to work with whatever URLFetchApp.fetch() gives me.

I don't have to parse the whole document, so if the markup error is in a part of the document that I don't have to actually check, I can afford to simply not care about it.

Is there any way to automatically correct markup errors?

Or alternatively, is it possible to start parsing from somewhere other than the beginning (in particular from gsc-results gsc-webResult)?

Thank you for your attention.

EDIT:

By using Xml.Parse() it successfully parse the webpage, but the result is this.

 <?xml version="1.0" encoding="UTF-8"?><body><noscript><h3>Google Custom Search requires JavaScript</h3><p>JavaScript is either disabled or not supported by your browser. To use Custom Search, enable JavaScript by changing your browser options and reloading this page.</p></noscript><div id="cse-hosted"><div id="cse-header"><a href="#" id="cse-logo-target" shape="rect"/><div id="cse-logo"><span class="lockup-logo"/> <span class="lockup-text"><span class="lockup-brand"> Custom Search</span></span></div><div id="cse-search-form">Loading</div></div><div id="cse-body"><div id="cse">Loading<div class="gsc-adBlock gsc-imageResult-classic gsc-imageResult-column gsc-clear-button gsc-branding hidden"/></div></div><div id="cse-footer">© 2017 Google</div></div></body>

Which is not the result I'm expecting. What can I do to solve this issue? Thanks in advance.

user
  • 11
  • 4

1 Answers1

0

The error occur because the content you are passing to Xml Service Service isn't XHTML so, one way that the question could be interpreted is

How to convert HTML to XHTML by using Google Apps Script?

Google Apps Script doesn't include a built-in service that does this, so you could try to use the deprecate Xml Service that is "tolerant" to some markup errors.

Another alternative is to use JavaScript string handling techniques, like the use use of regular expressions.

For details see What is the best way to parse html in google apps script.

Rubén
  • 34,714
  • 9
  • 70
  • 166