0

I have a website that I need to scrape. The content is below (and the page numbers are variable):

    <a class="page-numbers" href="https://example.com/page/6/">6</a>
    <a class="page-numbers" href="https://example.com/page/7/">7</a>
    <a class="page-numbers" href="https://example.com/page/8/">8</a>
    <a class="next page-numbers" href="https://example.com/page/49/">NEXT</a>

I need to get the last page number, which in the above example is 8.

I'm using AppScript with Google sheets and I've tried various solutions, including grouping so it displays the full page numbers. My final output (based on the above example) should appear as: Total pages: 8

Could any of you REGEX wizards help?

Additional notes:

  • Using pure JS isn't an option
  • There can be any amount of pages, what I'm looking for isn't always the third occurrence

The below is returning nothing.

function regex_validshouldwork() {
  const url = 'https://example.com',
  response = UrlFetchApp.fetch(url);
  let content ;
  let html = response.getContentText();
 
const myRegex= new RegExp("(?:\<a class=\"page-numbers[^>]*>(\d+)<\/a>\s*)+");
content = html.match(myRegex);

   SpreadsheetApp.getActiveSheet().getRange('a2').setValue(content);
}
  • 2
  • Thank you so much @bobblebubble. That seemed to do the job! You're a hero. – James Osborne May 23 '23 at 01:31
  • 2
    Glad it helped, if using `new RegExp` you need to do different escaping. [See the **updated JS demo**](https://tio.run/##pdAxb8IwEAXgPb/C8mSDwKpUQUpIOrF2QB2QMBWue01axY4VHzT99cEWEVLZoNsNp@89vW91VF63Xw4nx7TvdWM9kgpNTXKyXyqia@V9Tp0qYWIP5h1aT0nVwmdOK0TnF0JAp4yrYaobI@KfmAlazJZCFcmdwDwA8/8AaQDSK8BCh@Qm5fEpMC@rzWuU9lmSnNcxv2soocuJhR8SzlXnGGXPi0uU/FN2@1bsRgWT8mPMIySlH/Ex5VnkECyGpePgU6NQV2zQ@fZhNyQ2oVPdlGx451nfnwA): `const myRegex= new RegExp("(?:]*>(\\d+)\\s*)+");` – bobble bubble May 23 '23 at 08:35

0 Answers0