2

I need to parse table data to my spreadsheet, there is no error in log, but cells in spreadsheet are blank. The problem is that I can't use built-in importhtml function, becouse the date-related data in the tag is enered dynamically on the website. I tryied to getchild getchildren but it doesn't work.

The structure of html site looks like this:

   <html>
      <head>
        <title>TITLE AAAAA</title>
        <meta http-equiv="Content-Type" content="text/html; charset=utf-8">
        <meta content="IE=EmulateIE7" http-equiv="X-UA-Compatible">
        <style type="text/css">
          body {
            font-size: 12px;
            font-family: Arial
          }
          td {
            font-size: 12px;
            line-height: 20px;
            font-family: Arial
          }
        </style>
        <script type="text/javascript" language="javascript" src="Function.js"></script>
      </head>
      <body>
        <p align="center">
          <b>AAAA: &nbsp; AAAAAA</b>
        </p>
        <table width="300" border="0" align="center" cellpadding="1" cellspacing="1" bgcolor="#0066cc">
          <tbody>
            <tr align="center" bgcolor="#333399" class="font13">
              <td width="150">
                <b>
                  <font color="#ffffff">TO_CELL_A1_TEXT</font>
                </b>
              </td>
              <td width="150">
                <b>
                  <font color="#ffffff">TO_CELL_B1_TEXT</font>
                </b>
              </td>
              <td width="150">
                <b>
                  <font color="#ffffff">TO_CELL_C1_TEXT</font>
                </b>
              </td>
              <td width="150">
                <b>
                  <font color="#ffffff">TO_CELL_D1_TIME_TEXT</font>
                </b>
              </td>
            </tr>
            <tr align="center" bgcolor="#FFFFFF">
              <td height="20">
                <b>
                  <font color="red">TO_CELL_A2_TEXT</font>
                </b>
              </td>
              <td>
                <b>
                  <font color="red">TO_CELL_B2_TEXT</font>
                </b>
              </td>
              <td>
                <b>
                  <font color="red">TO_CELL_C2_TEXT</font>
                </b>
              </td>
              <td>
                <script>
                  showtime(2023, 01 - 1, 13, 23, 01, 12)
                </script>"TO_CELL_D2_TIME_TEXT"
              </td>
            </tr>
          </tbody>
        </table>
        <br>
        <p align="center">SITE_NAME</p>
      </body>

    </html>

Is there any solution to this problem?

James2023
  • 21
  • 5
  • Hi, thank you for you answer. As u may see in source code of website there is – James2023 Jan 14 '23 at 15:13
  • 2
    Understood. See the links at the end of [How to know if Google Sheets IMPORTDATA, IMPORTFEED, IMPORTHTML or IMPORTXML functions are able to get data from a resource hosted on a website](https://webapps.stackexchange.com/a/115665/269219) for some pointers. – doubleunary Jan 14 '23 at 15:45
  • Welcome to [so]. The question isn't a good fit for this site. First you should remove the "chit-chat" (all the stuff about yourself should not be included on questions or answers, if it's important to you to share that, you could include it on your profile). When asking for help on fixing a code, the question should include a [mcve], means that besides the page structre and the code, you should include the textual error messages and any other relevant details necessary for others to reproduce the error. – Rubén Jan 14 '23 at 17:48
  • Also, you should include a brief description of your search efforts to find helpful content from this site explaining why the most relevant posts where not helpful to you. – Rubén Jan 14 '23 at 17:49
  • Due to the question title and the code, it looks that you used very old sources. Search this site for questions about usting google apps script for web scraping. This might give you some hints: https://stackoverflow.com/q/69350317/1595451 – Rubén Jan 14 '23 at 17:56
  • Thank you for your advice @Rubén. I improved my code a bit, removing unnecessary lines and using other class elements. At the moment, the execution log does not show any errors, but the values are not entered into the spreadsheet. Please take a look at the thread if you have time, I posted the code there after the fix. I am also learning from /doubleunary links. – James2023 Jan 14 '23 at 18:20
  • I added it as an answer. Thanks a lot for your respond. I will read the content and correct the code. I just thought that xmlservice.parse would handle such a simple case, I also searched the forum before, but I did not find the threads you indicated. – James2023 Jan 14 '23 at 18:32
  • James: The workings of this are very different from other sites. Actualy many users dislike a lot the use of terms like "forum" and "threads"... Please [edit] this question to make it a good fit for this site. As there are no answers yet, there is no need to keep the old code. – Rubén Jan 14 '23 at 18:34
  • 1
    Fixed that, deleted answer and edited first post. – James2023 Jan 14 '23 at 18:41
  • On this site titles should not contain tags; should not be used to indicate things like "solved" and questions should not include the solution, instead post the solution as an answer. You could mark it as accepted. – Rubén Jan 21 '23 at 16:25
  • I'm really sorry Ruben for giving you work on this "question", I really didn't do it intentionally. Looks like you fixed my question regarding title and tags. Should I cut the solution out of the question and add it as an answer below anyway? Where (and what) can I mark as accepted? – James2023 Jan 21 '23 at 17:05
  • Well fixed that anyway. Sorry. – James2023 Jan 21 '23 at 17:23
  • No problem. Thank you to coming back to post your soluction as answer. – Rubén Jan 21 '23 at 18:25

2 Answers2

1

The code is not importing anything because it's looking for <tr> as children of <table> but, assuming that the HTML included in the question is the response given by the server, they are children of <tbody>. One simple fix is to replace

var rows = table.getChildren("tr");

by

var tbody = table.getChild("tbody");
var rows = tbody.getChildren("tr");

There are more elegant solutions like using specialised libraries on parsing HTML/XML documents but before entering on using them you should spend some time learning the basics of Google Apps Script and Document Object Model (DOM).

Related

Rubén
  • 34,714
  • 9
  • 70
  • 166
  • Now i recived error: 20:14:18 Error TypeError: Cannot read properties of null (reading 'getChildren') importParsedData @ Code.gs:16 at line: var rows = tbody.getChildren("tr"); after changing this line to: var rows = tbody.getChildren("rows"); I get same error at same line. – James2023 Jan 14 '23 at 19:17
  • 1
    @James2023 Since you didn't provided the real URL I used the HTML provided in the answer as the initial value of the variable `html`. It's very likely that the HTML that you provided in the question is not the same that is being delivered by the server. I suggest you to post a new question either to include the real url or to provide the response from the server. Sinse the response from the server might be too long for the execution logs you should have to save it as a file. If you don't know how to do that, post a new question. – Rubén Jan 14 '23 at 19:22
  • I also tryied to get elements by class name before i posted my question, but it seems it doesn't work in apps script anymore. – James2023 Jan 14 '23 at 19:25
  • 1
    I don't know what you mean by "get elements by class name" as that is not, and never was, a built-in feature in Google Apps Script. – Rubén Jan 14 '23 at 19:27
  • As you mentioned, it seems that I need to spend some time learning the basics and using external libraries. Thanks a lot, I will try tomorrow after spending some time studying. – James2023 Jan 14 '23 at 19:33
  • 1
    Before investing more time on this... first checkout [Scraping data to Google Sheets from a website that uses JavaScript](https://stackoverflow.com/q/74237688/1595451) – Rubén Jan 14 '23 at 19:33
  • It seems that I solved my problem :). – James2023 Jan 21 '23 at 08:49
0

Using the cheerio library, I solved the problem, also added the code that in the D2:D cells it replaces the values, for example: showtime(2023,01-1,20,21,09,48)" into the correct date and time format. I recommend reading the materials linked here by users @rubén and @doubleunary. If you run into a problem like me and you are a beginner, loading the Cheerio library via clasp into the cloned script on your computer will not work, because for const cheerio = require('cheerio');, require is a function available in the CJS module system that handles imports, basically. GitHub users came up against this problem and you should look for a solution to implement Cheerio in GAS, bearing in mind that these are not libraries recommended by Google, but created by users.

Here is working script in my case:

function importData() {
  var url = 'http://urlsite.com';
  var res = UrlFetchApp.fetch(url, {
    muteHttpExceptions: true
  }).getContentText();

  res = res.replace(/<script[^>]*>([\s\S]*?)<\/script>/gi, function(match, capture) {
    return capture;
  });
  const $ = Cheerio.load(res);

  var col1 = $('table tr td:nth-child(1)').toArray().map(x => $(x).text());
  var col2 = $('table tr td:nth-child(2)').toArray().map(x => $(x).text());
  var col3 = $('table tr td:nth-child(3)').toArray().map(x => $(x).text());
  var dateValues = $('table tr td:nth-child(4)').toArray().map(x => $(x).text());
  var table = dateValues.map((d, i) => [col1[i], col2[i], col3[i], d]);
  var range = SpreadsheetApp.getActiveSheet().getRange(1, 1, table.length, table[0].length);
  range.setValues(table);

  var sheet = SpreadsheetApp.getActiveSheet();
  var range = sheet.getRange("D2:D");
  var values = range.getValues();
  var convertedDates = values.map(function(value) {
    var match = value[0].match(/showtime\((\d+),(\d+)-1,(\d+),(\d+),(\d+),(\d+)\)/);
    if (match) {
      var year = match[1];
      var month = match[2];
      var day = match[3];
      var hour = match[4];
      var minute = match[5];
      var second = match[6];
      var date = new Date(year, month - 1, day, hour, minute, second);
      return Utilities.formatDate(date, "GMT", "yyyy-MM-dd HH:mm");
    }
    return "";
  });
  range.setValues(convertedDates.map(function(x) {
    return [x];
  }));
}

Moreover, while studying, I wrote a script to check ability to fetch data from website, what data is fetched and how does the tree of html looks like. You can use it in doc.new script.

function addMenuTab() {
  var document = DocumentApp.getActiveDocument();
  var menu = DocumentApp.getUi().createMenu("MY PROGRAMS")
    .addItem("1_CLEAN DOC", "menuItem1Function")
    .addItem("2_GENERATE", "menuItem2Function")
    .addItem("3_BOLD RESPONSES", "menuItem3Function")
    .addToUi();
}

function menuItem1Function() {
  DocumentApp.getActiveDocument().getBody().clear();
}

function menuItem2Function() {
  var ui = DocumentApp.getUi();
  var result = ui.prompt(
    'Check ability to parse website!',
    'Please enter your url:',
    ui.ButtonSet.OK_CANCEL);
  var button = result.getSelectedButton();
  var url = result.getResponseText();
  if (button == ui.Button.OK) {
    ui.alert('Your URL is ' + url + '.');
  } else if (button == ui.Button.CANCEL) {
    ui.alert('I didn\'t get your URL.');
  } else if (button == ui.Button.CLOSE) {
    ui.alert('You closed the dialog box.');
  }

  var body = DocumentApp.getActiveDocument().getBody();
  body.clear();
  var response = UrlFetchApp.fetch(url);
  var responseOpts1 = UrlFetchApp.fetch(url, opts1);
  /**
  var responseOpts2 = UrlFetchApp.fetch(url, opts2);
  */
  var opts1 = {
    // methods: get, delete, patch, post, put - by default: get
    'method': 'get',

    // contentType: 'application/json' ;  'application/xml' ; 'application/html' - by default: application/x-www-form-urlencoded&#39
    'contentType': 'text/javascript',

    // key/value map of the HTTP headers for the request;
    'headers': '1',
  }

  /**
  var Opts2 = {
  //methods: get, delete, patch, post, put - by default: get
  'method' : 'get',
  'payload' : 'formdata'
  }
  */

  var responseCode = response.getResponseCode();
  Logger.log("Response Code: " + responseCode);

  var headers = response.getHeaders();
  Logger.log("Headers: " + headers);

  var headersText = JSON.stringify(response.getHeaders());
  Logger.log("HeadersText: " + headersText);

  var contentLength = response.getContentText().length;
  Logger.log("Content Length: " + contentLength);

  var allHeaders = response.getAllHeaders();
  Logger.log("All Headers: " + allHeaders);

  var allHeadersText = JSON.stringify(response.getAllHeaders());
  Logger.log("All HeadersText: " + allHeadersText);

  var document = DocumentApp.getActiveDocument();
  var body = document.getBody();

  var contentText = response.getContentText();
  Logger.log("Content Text: " + contentText);

  var contentText2 = responseOpts1.getContentText();
  Logger.log("Content Text & Opts1: " + contentText2);

  /**
  var contentText3 = responseOpts1.getContentText();
  // Replace html elements
  contentText3 = contentText3.replace(/(<(?=link|meta)[^>]*)(?<!\/)>/ig, '$1/>');
  contentText3 = contentText3.replace(/&(?!amp;)/ig, '&amp;');
  contentText3 = contentText3.replace(/ /ig, " ");
  contentText3 = contentText3.replace(/<table[^>]*>/ig, "<table>");
  contentText3 = contentText3.replace(/<tr[^>]*>/ig, "<tr>");
  contentText3 = contentText3.replace(/width[^>]*>/ig, "<width>");
  contentText3 = contentText3.replace(/<td[^>]*>/ig, "<td>");
  contentText3 = contentText3.replace(/<font[^>]*>/ig, "<font>");
  contentText3 = contentText3.replace(/<p[^>]*>/ig, "<p>");
  contentText3 = contentText3.replace(/width[^>]*>/ig, "<width>");
  contentText3 = contentText3.replace(/<br>|<\/br>/ig, "");
  contentText3 = contentText3.replace(/<style[^>]*>/ig, "<style>");
  contentText3 = contentText3.replace(/(<(p|script|style)[^>]*>)/ig, '$1<![CDATA[').replace(/(<\/(p|script|style)[^>]*>)/ig, ']]>$1')
  Logger.log("Content Text & Opts: " + contentText3);

  var contentText4 = XmlService.parse("<!DOCTYPE html>" + contentText3);
  var root = contentText4.getRootElement().getChild("body");
  var table = root.getChild("table");
  var tbody = table.getChild("tbody");
  var tr = table.getChild("tr");
  var td = tr.getChild("td"); 
  var font = td.getChildren("font"); 
  var script = td.getChildren("script"); 
  */


  body.appendParagraph("Response Code: " + responseCode);
  body.appendParagraph("");
  body.appendParagraph("Headers: " + headers);
  body.appendParagraph("");
  body.appendParagraph("HeadersText: " + headersText);
  body.appendParagraph("");
  body.appendParagraph("Content Length: " + contentLength);
  body.appendParagraph("");
  body.appendParagraph("All Headers: " + allHeaders);
  body.appendParagraph("");
  body.appendParagraph("All HeadersText: " + allHeadersText);
  body.appendParagraph("");
  body.appendParagraph("Content Text: " + contentText);
  body.appendParagraph("");
  /** 
  body.appendParagraph("Content Text & Opts1: " + contentText2);
  body.appendParagraph("");
  body.appendParagraph("Content Text3: " + contentText3);
  body.appendParagraph("");
  body.appendParagraph("Content Text4: " + contentText4);
  body.appendParagraph("");
  body.appendParagraph("Root: " + root);
  body.appendParagraph("");
  body.appendParagraph("Table: " + table);
  body.appendParagraph("");
  body.appendParagraph("Tbody: " + tbody);
  body.appendParagraph("");
  body.appendParagraph("Tr: " + tr);
  body.appendParagraph("");
  body.appendParagraph("Td: " + td);
  body.appendParagraph("");
  body.appendParagraph("font: " + font);
  body.appendParagraph("");
  body.appendParagraph("script: " + script);
  body.appendParagraph("");
  */
}

function menuItem3Function() {
  var doc = DocumentApp.getActiveDocument();
  var text = doc.getBody().getText();
  var searchString1 = "Response Code: ";
  var searchIndex = text.indexOf(searchString1);
  var responseCode = text.substring(searchIndex + searchString1.length, searchIndex + searchString1.length + 3);
  var textSearched = doc.getBody().findText(searchString1 + responseCode);
  var element1 = textSearched.getElement();
  if (responseCode === "200") {
    element1.setBold(true);
    element1.setForegroundColor("#9cd161");
  } else {
    element1.setBold(true);
    element1.setForegroundColor("#ff0000");
  }

  var body = doc.getBody();
  var searchString = ["Headers", "HeadersText", "Content Length", "All Headers", "All HeadersText", "Content Text", "Content Text & Opts:", "Content Text3:"];
  for (var i = 0; i < searchString.length; i++) {
    var searchResult = body.findText(searchString[i]);
    while (searchResult !== null) {
      var searchResultElement = searchResult.getElement();
      var start = searchResult.getStartOffset();
      var end = searchResult.getEndOffsetInclusive();
      searchResultElement.setBold(start, end, true);
      searchResult = body.findText(searchString[i], searchResult);
    }
  }
}
James2023
  • 21
  • 5