0

Here's a Google Document content:

Some text, more text...

<li>
some lines
more lines...
</li>

And more text

I would like a regex to match:

<li>
...
</li>

So far it returns null. My regex only finds <li>...</li>, but not with new lines, although I am using the (?s) tag suggested to ensure that . includes any character and new lines:

(?s)<li>(.)*?</li>

My regex works in https://regexr.com/ and https://regex101.com/, so I don't understand it doesn't in Google App Script.

Greg Forel
  • 2,199
  • 6
  • 25
  • 46
  • Regex is badly suited for tasks like this. You should use javascript's DOMParser. – CAustin Dec 10 '19 at 00:55
  • Yes I know you're right. I thought i could only use the DOMParser in a valid html file, as the content of my Google document will be a mix of a few snippets of code and normal text? – Greg Forel Dec 10 '19 at 10:55
  • It doesn't need specifically be valid HTML, just well formed XML. So as long as there's a single root element and the nested contents follow proper XML syntax, it should be fine. – CAustin Dec 11 '19 at 02:21

1 Answers1

2
  • You want to retrieve the text of <li ...>....</li> in Google Document.
  • You want to achieve this using Google Apps Script.

If my understanding is correct, how about this answer? Please think of this as just one of several possible answers.

Issue and workaround:

In your case, you want to use the pattern of <li sheet="[a-zA-Z0-9]*">[\s\S]*?<\/li>, please modify to <li sheet="[a-zA-Z0-9]*">[\\s\\S]*?<\/li>. In your case, <li ...>....</li> has several paragraphs. (From your sample value, I thought like this.) By this, when the pattern of const searchPattern = '<li sheet="[a-zA-Z0-9]*">[\\s\\S]*?<\/li>' is used for body.findText(searchPattern), null is returned. If <li ...>....</li> is put as one paragraph, body.findText(searchPattern) returns <li ...>....</li>.

In order to search <li ...>....</li> which has several paragraphs, how about the following workaround? The flow of this workaround is as follows.

Flow:

  1. Use <li sheet= and <\/li> as patterns for searching.
  2. Using the pattern of <li sheet=, retrieve the begin paragraph of <li ...>.
  3. Using the pattern of <\/li>, retrieve the end paragraph of </li>.
  4. Retrieve the texts between the retrieved begin and end paragraph.
  5. This cycle is continued until all <li ...>....</li> values are searched.

Sample script:

function parseLists(body) {
  // var doc = DocumentApp.getActiveDocument();
  // var body = doc.getBody();

  var pattern1 = "<li sheet=";
  var pattern2 = "<\/li>";
  var range1 = body.findText(pattern1);
  var res = [];
  while (range1) {
    var temp = {};
    var p1 = range1.getElement().getParent();
    temp.startIndex = body.getChildIndex(p1);
    var range2;
    if (p1) {
      range2 = body.findText(pattern2, range1);
      var p2 = range2.getElement().getParent();
      temp.endIndex = body.getChildIndex(p2) + 1;
      var texts = "";
//      for (var i = temp.startIndex + 1; i < temp.endIndex - 1; i++) {
      for (var i = temp.startIndex; i < temp.endIndex; i++) {
        texts += body.getChild(i).asParagraph().getText();
      }
      temp.texts = texts;
      res.push(temp);
    }
    range1 = body.findText(pattern1, range2);
  }
  Logger.log(res)
}

Result:

When your sample values are put to new Google Document and run the script, the following result is retrieved.

[
  {
    "startIndex": 0,
    "endIndex": 5,
    "texts": "<li sheet=\"experiences\">{{company_name}},  {{job_location}} — {{job_title}}MONTH {{from}} - {{to}}{{description}}</li>"
  },
  {
    "startIndex": 6,
    "endIndex": 9,
    "texts": "<li sheet=\"other\">{{test}}</li>"
  }
]
  • For above result, if you want to retrieve the values of {{company_name}}, {{job_location}} — {{job_title}}MONTH {{from}} - {{to}}{{description}} and {{test}} without the tags, please modify above script as follows.

    • From:

      for (var i = temp.startIndex; i < temp.endIndex; i++) {
      
    • To:

      for (var i = temp.startIndex + 1; i < temp.endIndex - 1; i++) {
      

References:

If I misunderstood your question and this was not the direction you want, I apologize.

Tanaike
  • 181,128
  • 11
  • 97
  • 165
  • Hi Tanaike, thank you very much for your answer. I will certainly use part of it, but I simplified the description of my issue to focus on 1 problem only: matching `
  • ...
  • ` when `...` has any character, including new lines. The regex you suggested is considered invalid by Google App Scripts. Any idea? – Greg Forel Dec 10 '19 at 14:08
  • I might be out of luck: https://stackoverflow.com/questions/37771381/eliminate-newlines-in-google-app-script-using-regex – Greg Forel Dec 10 '19 at 14:41
  • @Greg Forel Thank you for replying. I have to apologize for my poor English skill. Unfortunately, I cannot understand about your current issue of `The regex you suggested is considered invalid by Google App Scripts.`. Because in my test, your sample values can be used with the sample script. So can you explain about the detail information of it? I would like to think of it. – Tanaike Dec 10 '19 at 22:32
  • sorry, your regex does work. I made a mistake using it. This is helpful to solve my problem, thanks again! – Greg Forel Dec 11 '19 at 13:16