1

I am trying to web scrape a dynamic website with puppeteer, using this code:

const puppeteer = require('puppeteer');

async function getTokoPedia(){
    const browser = await puppeteer.launch({ headless: false }); // for test disable the headlels mode,
    const page = await browser.newPage();
    await page.setViewport({ width: 1000, height: 926 });
    await page.goto("https://store.401games.ca/collections/pokemon-singles",{waitUntil: 'networkidle2'});

    console.log("start evaluate javascript")

    var productNames = await page.evaluate(()=>{
        var div = document.querySelectorAll('.info-container');
        console.log(div) // console.log inside evaluate, will show on browser console not on node console
        
        var productnames = [] 
        div.forEach(element => { 
            var price = element.querySelector(' .fs-result-page-3sdl0h')
            if(price != null){
                productnames.push(price.innerText);
            }
        });

        return productnames
    })

    console.log(productNames)
    browser.close()
} 

getTokoPedia();

However, upon running it, I get back an empty array. How can I fix this?

ggorlen
  • 44,755
  • 7
  • 76
  • 106
Medusa
  • 25
  • 4
  • You need to give the website some time to load the contents. You might be able to use this: https://puppeteer.github.io/puppeteer/docs/puppeteer.page.waitforselector/ –  Jun 01 '22 at 22:04

1 Answers1

0

Two problems:

  1. The elements you want are in a shadow root, so you have to pierce the root as described in Puppeteer not giving accurate HTML code for page with shadow roots.
  2. The cards lazy load, so you'd have to scroll down to be able to populate their data into the DOM.

But there's an easier way to get the initial set of data, which is in the static HTML as a JSON blob in var meta = {"products":...};. You can scrape it with a regex, as described in this tutorial.

Here's an example showing both approaches, including piercing the shadow roots manually and with >>>:

const puppeteer = require("puppeteer"); // ^19.11.1

let browser;
(async () => {
  browser = await puppeteer.launch({headless: true});
  const [page] = await browser.pages();
  const url = "https://store.401games.ca/collections/pokemon-singles";
  await page.goto(url, {waitUntil: "domcontentloaded"});

  {
  // here's the hard way for illustration:
  const el = await page.waitForSelector("#fast-simon-serp-app");
  await page.waitForFunction(({shadowRoot}) =>
    shadowRoot.querySelector(".product-card .title")
  , {}, el);
  const items = await el.evaluate(({shadowRoot}) =>
    [...shadowRoot.querySelectorAll(".product-card")]
      .map(e => ({
        title: e.querySelector(".title")?.textContent,
        price: e.querySelector(".price")?.textContent,
      }))
  );
  console.log(items); // just the first 6 or so
  }

  {
  // a little bit easier, using >>>:
  await page.waitForSelector(">>> .product-card .title");
  const items = await page.$$eval(">>> .product-card", els =>
    els.map(e => ({
      title: e.querySelector(".title")?.textContent,
      price: e.querySelector(".price")?.textContent,
    }))
  );
  console.log(items); // still just the first 6 or so
  }
  // TODO scroll the page to get the rest;
  // I didn't bother implementing that...

  // ...or do it the easy way:
  const html = await page.content();
  const pat = /^[\t ]*var meta = ({"products":[^\n]+);$/m;
  const data = JSON.parse(html.match(pat)[1]);
  console.log(JSON.stringify(data, null, 2));
})()
  .catch(err => console.error(err))
  .finally(() => browser?.close());

At this point, since we're not dealing with anything but the static HTML, you can dump Puppeteer and use axios or fetch to get the data more efficiently:

const axios = require("axios");

axios.get("https://store.401games.ca/collections/pokemon-singles")
  .then(({data: body}) => {
    const pat = /^[\t ]*var meta = ({"products":[^\n]+);$/m;
    const data = JSON.parse(body.match(pat)[1]);
    console.log(JSON.stringify(data, null, 2));
  })
  .catch(err => console.error(err));

Now, the data.products array contains 50 but the UI shows 26466 results. If you want more than those initial items from the static HTML's var meta, which appears to be the same on all 1000+ pages, I suggest using the API. A URL looks like https://ultimate-dot-acp-magento.appspot.com/categories_navigation?request_source=v-next&src=v-next&UUID=d3cae9c0-9d9b-4fe3-ad81-873270df14b5&uuid=d3cae9c0-9d9b-4fe3-ad81-873270df14b5&store_id=17041809&cdn_cache_key=1654217982&api_type=json&category_id=269055623355&facets_required=1&products_per_page=5000&page_num=1&with_product_attributes=true. You can see there are ids and keys that probably protect against usage by parties other than the site, but I didn't see any change other than cdn_cache_key after a few tries. I'm not sure how long a URL is valid, but while it is, you can set products_per_page=1000 for example, then move page_num=1 forward 27 times or so. This gets you all of the data while avoiding all of the difficulties of scraping from the page itself.

Here's a pessimistic approach that uses Puppeteer to get an up-to-date URL, in case a URL goes stale:

const axios = require("axios");
const puppeteer = require("puppeteer");

let browser;
(async () => {
  browser = await puppeteer.launch();
  const [page] = await browser.pages();
  const url = "https://store.401games.ca/collections/pokemon-singles";
  const reqP = page.waitForRequest(res =>
    res.url()
      .startsWith("https://api.fastsimon.com/categories_navigation")
  );
  await page.goto(url, {waitUntil: "domcontentloaded"});
  const req = await reqP;
  const apiUrl = req
    .url()
    .replace(/(?<=products_per_page=)(\d+)/, 1000);
  const {data} = await axios.get(apiUrl);
  console.log(JSON.stringify(data, null, 2));
})()
  .catch(err => console.error(err))
  .finally(() => browser?.close());

And tossing in the loop:

const axios = require("axios");
const fs = require("fs/promises");
const puppeteer = require("puppeteer");

let browser;
(async () => {
  browser = await puppeteer.launch();
  const [page] = await browser.pages();
  const url = "https://store.401games.ca/collections/pokemon-singles";
  const reqP = page.waitForRequest(res =>
    res.url()
      .startsWith("https://api.fastsimon.com/categories_navigation")
  );
  await page.goto(url, {waitUntil: "domcontentloaded"});
  const req = await reqP;
  const apiUrl = req
    .url()
    .replace(/(?<=products_per_page=)(\d+)/, 1000);
  const items = [];

  for (let i = 1;; i++) {
    const pageUrl = apiUrl.replace(/(?<=page_num=)(\d+)/, i);
    const response = await axios.get(pageUrl);

    if (response.status !== 200 ||
        items.length >= response.data.total_results) {
      break;
    }

    items.push(...response.data.items);
  }

  await fs.writeFile("data.json", JSON.stringify(items));
  console.log(items.slice(0, 10));
  console.log(items.length);
})()
  .catch(err => console.error(err))
  .finally(() => browser?.close());

This hammers the site, pulling a ton of data in a short amount of time, so consider this script for educational purposes, or modify it to throttle your requests way back.

ggorlen
  • 44,755
  • 7
  • 76
  • 106
  • Thank you! I used the first method and it worked very well. I used the first method as even after reading the tutorial for the second method, I still didn't quite understand it, but will regardless try to do so in the foreseeable future. – Medusa Jun 02 '22 at 14:02
  • Do you know how I would do pagination? I tried following this post:https://stackoverflow.com/questions/52325114/overcoming-pagination-when-using-puppeteer-library-for-web-scraping but it doesn't quite work. – Medusa Jun 02 '22 at 17:23
  • I figured that'd be the follow-up question. I checked the network requests, and there's an API that appears to have the bulk of the data, assuming you can get the key or intercept it. A sample link is https://ultimate-dot-acp-magento.appspot.com/categories_navigation?request_source=v-next&src=v-next&UUID=d3cae9c0-9d9b-4fe3-ad81-873270df14b5&uuid=d3cae9c0-9d9b-4fe3-ad81-873270df14b5&store_id=17041809&cdn_cache_key=1654217982&api_type=json&category_id=269055623355&facets_required=1&products_per_page=2500&page_num=1&with_product_attributes=true – ggorlen Jun 03 '22 at 01:10
  • The above link will probably expire by the time you click it, but you can get it dynamically with Puppeteer, then send your own request with `products_per_page=2500` (or even higher) and grab all the data that way with a few easy HTTP requests. Using Puppeteer to pull 1059 pages at 30 results each and dealing with the slowness of scrolling and navigation is painful to think about. The coding is annoying, unreliable and the script would take a long time to run, if you're lucky and it doesn't crash or start missing data halfway through due to some unforseen edge case. – ggorlen Jun 03 '22 at 01:12
  • How would I find out if a website has a shadow root? I am trying to do the same thing for the https://hairyt.com/pages/pokemon-advanced-search?q=&game=pokemon&availabilty=true&setNames=&rarities=&types=&pricemin=5.00&pricemax=25.00&page=2&order=price-descending, but can't seem to figure out how to apply your first method to it? Thank you. – Medusa Jun 03 '22 at 16:54
  • I usually just look at the elements in the inspector, but you could modify the code in [this post](https://stackoverflow.com/questions/68525115/puppeteer-not-giving-accurate-html-code-for-page-with-shadow-roots/68540701#68540701) to recursively hunt for `shadowRoot` properties if you want to do it programmatically (maybe there's a better way). The new page doesn't seem to have a shadow root though so you should be able to select as normal. That said, I suggest using the API or requests first. It's so much easier than scraping the DOM, which is basically last resort. – ggorlen Jun 03 '22 at 17:52
  • In this case, looks like they're POSTing to https://portal.binderpos.com/external/shopify/products/forStore for the data (on a quick inspection). Better to ask a new question if you're stuck with something else. – ggorlen Jun 03 '22 at 17:55
  • Thank you once again. I took your advice but it seems that the API link is being protected by some sort of authorization. – Medusa Jun 03 '22 at 19:14
  • Sure, but did you try intercepting the requests or making your own using the same strategy I showed above on the original site here? – ggorlen Jun 03 '22 at 23:03