2

I hit a snag. This code snippet works great to get film ratings, when they exist. It errors out when it gets to a record that doesn't include the regex. "TypeError: Cannot read property '0' of null"

  const ratingPass1 = /<span class="rating rated-([\s\S]*?)">/g;
  const ratingPass2 = /(?<=<span class="rating rated-).*?(?=\">)/g;

for(var i = 0; i < 18; i++)
  {  var rating1String = results[i].match(ratingPass1);
    Logger.log('content: ' + rating1String[0]);
    var rating2String = rating1String[0].match(ratingPass2);
--> error is here  Logger.log('content: ' + rating2String[0]); 

I'm too new to javascript to know how to implement an 'includes' or 'contains' or something of that ilk in this code. But I'm getting not too bad with Regex, and figured I might be able to turn the regex into one large excluded group with the included group within it, so I tried:

const ratingPass1 = /(?:<span class="rating rated-([\s\S]*?)">)/g;
var rating1String = results[i].match(ratingPass1);
    Logger.log('content: ' + rating1String[0]);

but I keep getting the error, and I should, I guess because I'm still saying "find it, but it exclude it", where I need a "if you don't find it, just ignore it". Maybe it's the "match" in

var rating1String = results[i].match(ratingPass1);
    Logger.log('content: ' + rating1String[0]);

that could be changed to say something like match OR ignore if null?


Update: It took quite a few hours, but I figured something out. Might just work by some fluke, but at least it works!

I replaced the variables and logging info with the following:

 var rating0String = ""; 
 var rating1String = results[i].match(ratingPass1);
  if(!ratingPass1){
    Logger.log('content: ' + rating0String);
    }else{
    Logger.log('content: ' + rating1String); 
    };
  var rating2String = results[i].match(ratingPass2);
    if(!ratingPass2){
      Logger.log('content: ' + rating0String);
      }else{
        Logger.log('content: ' + rating2String);
      };
lise
  • 163
  • 1
  • 9
  • 1
    How about include something like this `if (!rating1String) continue`, this would jump to the next iteration if the condition is not met. Rembeber that `String.prototype.match()` [returns](https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/String/match#return_value) `null` if no matches found. – Emel Feb 22 '22 at 08:41
  • 1
    [parsing HTML with regex](https://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454) is considered bad practice. Have you considered using library https://github.com/tani/cheeriogs ? – Kos Feb 22 '22 at 09:00
  • @Emel This works but I need it to iterate through 18 records. Using `if (!rating1String) continue` skips over Roma in this example and just displays 2 of 3 records. {Die Hard; Jim; 5} {Roma; Sue; } {Pig; Nathan; 4.5} – lise Feb 22 '22 at 20:23

2 Answers2

2

Using two regular expressions that match the same text twice makes little sense, especially since your first regex already contains a capturing group around the pattern part you want to extract. Just use the index of the capture on the match object.

You need to use

const ratingPass = /<span class="rating rated-([\s\S]*?)">/g;
for (const result of results) {
  const matches = result.matchAll(ratingPass);
  for (const match of matches) {
     Logger.log('rating1String: ' + match[0]);
     Logger.log('rating2String: ' + match[1]); 
  }
}

Here,

  • <span class="rating rated-([\s\S]*?)"> matches <span class="rating rated-, then captures any zero or more chars but as few as possible into Group 1 (with ([\s\S]*?)) and then matches ">
  • for (const result of results) {...} iterates over some results array
  • const matches = result.matchAll(ratingPass) gets all matches per result string
  • for (const match of matches) {...} iterates over the matches found
  • match[0] is the whole match value, match[1] is the part captured into Group 1.

Update after you shared the script

function DiaryImportMain() {
  DiaryImportclearRecords();
  const url = "https://letterboxd.com/tag/30-countries-2021/diary/";
  const str = UrlFetchApp.fetch(url).getContentText();
  const mainRegex = /<li class="poster-container">([\s\S]*?)<\/li>/gi;
  const results = str.match(mainRegex);
  const filmTitlePass = /height="225" alt="([\s\S]*?)"\/>/i;
  const usernamePass = /<strong class="name"><a href="\/(?:[\s\S]*?)\/">([\s\S]*?)<\/a><\/strong>/i;
  const ratingPass = /<span class="rating rated-([\s\S]*?)">/i;

  for(var i = 0; i < 18; i++) {
    Logger.log('content: ' + results[i]);
    const filmTitle = (results[i].match(filmTitlePass) || ['','']);
    const filmTitle1String = filmTitle[0]; 
    Logger.log('content: ' + filmTitle1String);
    const filmTitle2String = filmTitle[1];
    Logger.log('content: ' + filmTitle2String);
    const username = (results[i].match(usernamePass) || ['','']);
    const username1String = username[0];
    Logger.log('content: ' + username1String);
    const username2String = username[1];
    Logger.log('content: ' + username2String);
    const rating = (results[i].match(ratingPass) || ['','']);
    const rating1String = rating[0];
    Logger.log('content: ' + rating1String);
    const rating2String = rating[1];
    Logger.log('content: ' + rating2String);

    DiaryImportaddRecord(i+1, filmTitle2String, username2String, rating2String);
  }
}
Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
  • Wiktor, this works perfectly but for one thing, and it was my mistake for not making it clear in my request (I was trying to minimize the amount of code in my posting): I need the results to iterate over the source code because I'm getting other info as well. So I need "filmtitle" "username" "rating" for an 18 row table. I'm not sure how to use your code with a for(var i=0; i<18; i++. If you have the time or the inclination I've put my script in this shared gsheet https://docs.google.com/spreadsheets/d/1cDw0gJsOqWBazEakacAGtrjisqQJtAbERNv8y9Ifvpk/edit?usp=sharing – lise Feb 22 '22 at 18:37
  • 1
    @lise you do not need to hardcode 18. `for (const match of matches) {` iterates over all your `results`. Also, you should not double regexps, just use a capturing group. – Wiktor Stribiżew Feb 22 '22 at 20:27
  • 1
    @lise I revamped the functions. Mind you need to save the `match(regex)` result into a variable and if the match fails, assign it an array with two empty strings (one for the whole match, the other for the capture). Then, the whole match is in `match_result[0]` and the specific value is in `match_result[1]`. – Wiktor Stribiżew Feb 22 '22 at 20:41
  • Wiktor, thanks! I really appreciate the time you took to look at this. You must have posted just as I was having my own 'ah ha' moment. Not sure why the code in my (newly edited post) works, but it does! Having said that, I much prefer your way so I'm going to give it a go. (Still-quite proud of myself for having figured something out though!) – lise Feb 22 '22 at 21:17
  • Hmmm. Seems that all my records are undefined: 4:46:57 PM Info content: height="225" alt="The Wandering Earth"/> 4:46:57 PM Info content: undefined 4:46:57 PM Info content: livpope 4:46:57 PM Info content: undefined 4:46:57 PM Info content: 4:46:57 PM Info content: undefined Perhaps I misunderstood your previous comment - am I supposed to add those [match_result[0] and match_result[1]? – lise Feb 22 '22 at 21:53
  • 1
    @lise It is due to `match()` and `/g` flag. Since you need to get the first match in all cases, let's remove the `/g` flag in `filmTitlePass`, `usernamePass` and `ratingPass`. – Wiktor Stribiżew Feb 22 '22 at 21:57
  • in the 2nd comment you said: " you do not need to hardcode 18. for `(const match of matches) {` iterates over all your results iterates over all your results". I've tried mutliple ways to get rid of the for middle condition (after googling and trying quite a few things) but nothing works. If you have a minute, can you explain your suggestion? – lise Feb 23 '22 at 21:59
  • Ah, too late to edit previous comment. I finallyl found something that works: `for(var i = 0; i < results.length; i++) {` – lise Feb 23 '22 at 22:10
  • `for (const result of results)` works, too. – Wiktor Stribiżew Feb 23 '22 at 22:12
1

It can be done effectively using Cheerio library, check self-explanatory comments in code:

function matchRating()
{
  // TODO replace html with your data
  const html = '<div><span class="rating rated-one"></span><span class="rating rated-two"></span><span class="rating rated-three"></span></div>';
  
  // create Cheerio object
  const $ = Cheerio.load(html);

  const ratingPrefixForClass = 'rated-';
  // select all spans with `rating` class
  $(".rating").each((i, el) => {
    let classAttr = $(el).attr('class');
    
    // split class attribute to get list of class names, find one with needed prefix
    let ratingClassSearch = classAttr.split(' ').find(cls => cls.indexOf(ratingPrefixForClass) === 0);

    // if needed class with prefix found, log its name, and its name without prefix
    if (ratingClassSearch)
    {
      console.log(ratingClassSearch, ratingClassSearch.substring(ratingPrefixForClass.length));
    }
  });
}

Main points:

  1. Do not use regex for parsing HTML.
  2. Uses Cheerio JS library ported for Google Apps Script. To install it, you need add it as a dependency.
Kos
  • 4,890
  • 9
  • 38
  • 42
  • Kos, thanks. It's a bit of an emergency for me as I need to get this going in the next few days but after that I'll see about re-doing it the 'proper' way. – lise Feb 22 '22 at 18:40