1

I use xml service before.

But I got the error message about "xml is deprecated."

So I know xml cannot be used in the future,and the XmlService instead.

Here is my code before.

The solution comes from here.(by Mr.Justin Bicknell)

function xml_parsing(senderId) {
  var fetch =  UrlFetchApp.fetch
              ("https://home.gamer.com.tw/homeindex.php?owner=" + senderId);
  var doc = Xml.parse(fetch, true);
  var bodyHtml = doc.html.body.toXmlString();
  var xml = UrlFetchApp.fetch(url).getContentText();
  var doc_parse = XmlService.parse(xml);
  var root = doc_parse.getRootElement();

}

And I remove xml to fix it.

function xml_parsing(senderId) {

  var url = "https://home.gamer.com.tw/homeindex.php?owner=" + senderId;
  var fetch = UrlFetchApp.fetch(url).getContentText();
  var doc_parse = XmlService.parse(fetch);
  var root = doc_parse.getRootElement();

}

There is some errors about entities occured.

The entity name must immediately follow the '&' in the entity reference

So I fix the url by converting to entities type.

var url = "https://home.gamer.com.tw/homeindex.php?owner="+ senderId

There is some error,neither.

I google other document.

One said that the XmlService.parse is strict to Html.

Because Html contains less strict standard.

(For example: tags can be an end of tags,

but xml have to double tags enveloped)

So I want to ask how to use XmlService.parse on the situation?

Thanks!

Hsinhsin Hung
  • 343
  • 1
  • 2
  • 12
  • Can you provide the script for replicating your issue as the text value instead of the image? It will help users think of your issue and the solution. – Tanaike Jul 24 '19 at 00:18
  • 1
    @ Tanaike Thanks to your advice. I remove the picture by replacing the text value. – Hsinhsin Hung Jul 24 '19 at 00:24
  • Thank you for updating it. Although I'm not sure about the actual data, in your case, I think that the HTML can be parsed with XmlService by retrieving the required range from HTML data. As a sample, you can see it at [here](https://stackoverflow.com/q/53024255/7108653). – Tanaike Jul 24 '19 at 00:26
  • はじめまして@Tanaikeさん、よろしくお願いします。The problem is that `XmlService.parse()` doesn't have the same leniency as `Xml.parse()` did and any unclosed `` tags parsed to it will flag an error. The page has unclosed `` tags as well as a so-called unclosed opening ` ` tag. On top of this, `XmlService.parse()` expects an entity reference after seeing a `&` character, which is causing problems as there is JavaScript embedded in the document. Could you shed any light on how to mitigate this? `dl = l != 'dataLayer' ? '&l=' + l : '';` is the problem line at the moment. – Rafa Guillermo Jul 24 '19 at 13:16
  • Thank you for replying. In my sample script, as a workaround, XmlService is used by retrieving the required parts from the HTML data in order to avoid the issue in your replying. For example, can you provide a sample HTML data? By this, it might be able to clarify about this workaround. – Tanaike Jul 24 '19 at 22:42
  • @Tanaike Thank you for getting back to me. I have run your example code and while it works well, there is another issue because the page at `https://home.gamer.com.tw/homeindex.php?owner=username` contains multiple tags. I have been trying to format the code which you can have a look at [here](https://script.google.com/a/egs-sbt014.eu/d/1jj_8jTz-asNm1i4pSPZfjoUZ06x8ZWT0jgdwkUmgHCUVoYj1jG_XKgg_/edit) but honestly I'm not so sure how to get the line to parse without manually editing the line to escape the `&` which isn't suitable in case the webpage layout changes. – Rafa Guillermo Jul 25 '19 at 08:50
  • Thank you for replying and sharing the current script. Can I ask you about the values you want? Most HTML data cannot be directly parsed by XmlService. So in my sample script, I use the following flow. 1. Retrieve HTML data. 2. Confirm the required values. 3. When there are several required values, retrieve the range including the required values using the regex and Parser (GAS library). 4. Parse the retrieved values by XmlService and retrieve the required values. If this was not the direction you want, I apologize. – Tanaike Jul 25 '19 at 22:53
  • @Tanaike Your script is powerful but not helpful for this problem. The problem here is the deprecatation of the leniency option for parsing malformed HTML data when moving from Xml to XmlService. I ran your code and it does not handle open `` if there is no corresponding closed ``. JavaScript also is a problem as it uses XML scpecial characters as operators and are not escaped for parsing into XmlService.parse. – Rafa Guillermo Jul 27 '19 at 10:07
  • @Tanaike [I am writing a JavaScript-XML parser to try and fix this at the moment](https://github.com/rafa-guillermo/js-xml-parser) but I'm still working on edge cases. Once the embedded JavaScript data has had relevant characters escaped ([like you can see here](https://stackoverflow.com/questions/1091945/)) and the unclosed tags in the HTML Document are fixed, then the document can finally be parsed to XmlService. I've got a basic script to fix the unclosed `` tags in the page @HsinhsinHung wants to get which I posted in my previous comment but it needs to be more generic too. – Rafa Guillermo Jul 27 '19 at 10:11
  • @Rafa Guillermo @Tanaike Thanks for your replying these days. ご協力ありがとうございました。 Does this case solved? I read your google script link,there is `dl = l != 'dataLayer' ? '&l=' + l : '';` syntax error exist,too. – Hsinhsin Hung Jul 27 '19 at 11:43
  • @HsinhsinHung Hello, I am still working on your question. `` tags and embeddeed JavaScript that uses XML special characters are what's causing the problem. There is an issue about this on [Google's Issue Tracker](https://issuetracker.google.com/issues/117641636) which starring in the top left will let google know you're having the same issue. The Apps Script documentation [only specifies XML](https://developers.google.com/apps-script/reference/xml-service/xml-service#parsexml) can be parsed using `XmlService.parse(xml)` and so actually this function appears to be working as intended. – Rafa Guillermo Jul 27 '19 at 12:18
  • @HsinhsinHung In the mean time, to address your problem, the code I linked to on my github [in my previous comment](https://stackoverflow.com/questions/57173623/how-do-i-parse-the-html-for-corresponding-to-the-xml-standard?noredirect=1#comment100965395_57173623) will take an HTML document and fix the JavaScript, though I'm still working on converting ` – Rafa Guillermo Jul 27 '19 at 12:26
  • I thought that your goal might be different from my proposal. So I deeply apologize my proposal was not useful for your situation. – Tanaike Jul 27 '19 at 23:20

1 Answers1

1

You need to make sure that the string you are parsing to Xml.parse() is a string with valid XML containing no malformed tags nor unescaped special characters.

Since the Xml.parse() method of Apps Script was deprecated, the old leniency parameter that could be optionally set is not part of XmlService.parse().

XmlService.parse() is an XML parser, not an HTML parser. While the two document types have similar base structures, there are a few differences which cause XmlService.parse() to throw an error.

The first problem is that XML Documents can not have unclosed tags. As all HTML Documents start with a <!DOCTYPE html> tag, XmlService.parse() reads this as an open XML tag but because HTML does not close this, XML reads this as a malformed structure. <meta> tags in an HTML document also cause this problem as they too are non-closing, though in actuality any HTML tag in this format will cause XmlService.parse() to throw an error. User Tanaike has a really powerful workflow to rectify this which you can find here.

The second problem is that within the document you are trying to fetch, there is embeded JavaScript within <script></script> tags. XML has 5 special characters - &, ", ', <, and >.

All five of these characters are used as operators or string designators in JavaScript, and so unless they have been escaped into XML-safe format, ('&amp;', '&quot;', '&apos;', '&lt;', and '&gt;' respectively), the parser tries to read the special JavaScript characters as XML characters as they haven't been escaped. It is for this reason that it will throw an entity reference error. In your example page, The reference to entity "l" must end with the ';' delimiter. is thrown due to a &l in the code that hasn't been linted.

It seems the XmlService.parse() method is working as intended as it expects XML as a string, not HTML. There is however, a Bug on Google's Issue Tracker which details this as now that Xml has been deprecated there is no longer an Apps Script feature that does HTML to XML parsing. If you star the Bug on Issue Tracker in the top right you can let Google know you are also having issues with this, and get updates on their responses.

Rafa Guillermo
  • 14,474
  • 3
  • 18
  • 54