0

I am trying to load the product page using puppeteer but its not working.

    const puppeteer = require('puppeteer')

async function start(){
    const browser = await puppeteer.launch()
    const page = await browser.newPage()
    
    await page.setDefaultNavigationTimeout(0); 
    
    await page.setUserAgent('Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/105.0.0.0 Safari/537.36');
    
    url = "https://www.coupang.com/vp/products/2275049712?itemId=3903560010"
    await page.goto(url, {'waitUntil' : ['load', 'domcontentloaded', 'networkidle0', 'networkidle2']})
    await page.screenshot({path: "screenshot3.png", fullPage:true})
    await browser.close();
}

start()

If we open this URL it will load the page half and when we scroll down it loads rest of the page.

I tried using the scroll as well but it did not work.

Scroll function is following

    [const waitTillHTMLRendered = async (page, timeout = 30000) => {
    const checkDurationMsecs = 1000;
    const maxChecks = timeout / checkDurationMsecs;
    let lastHTMLSize = 0;
    let checkCounts = 1;
    let countStableSizeIterations = 0;
    const minStableSizeIterations = 3;
  
    while(checkCounts++ <= maxChecks){
      let html = await page.content();
      let currentHTMLSize = html.length; 
  
      let bodyHTMLSize = await page.evaluate(() => document.body.innerHTML.length);
  
      console.log('last: ', lastHTMLSize, ' <> curr: ', currentHTMLSize, " body html size: ", bodyHTMLSize);
  
      if(lastHTMLSize != 0 && currentHTMLSize == lastHTMLSize) 
        countStableSizeIterations++;
      else 
        countStableSizeIterations = 0; //reset the counter
  
      if(countStableSizeIterations >= minStableSizeIterations) {
        console.log("Page rendered fully..");
        break;
      }
  
      lastHTMLSize = currentHTMLSize;
      await page.waitForTimeout(checkDurationMsecs);
    }  
  };][2]
ggorlen
  • 44,755
  • 7
  • 76
  • 106
ScrapperMaster
  • 47
  • 1
  • 1
  • 3

1 Answers1

1

When I run this headfully, I don't see that the page loads fully with the review content. It seems to be detecting the bot and blocking those reviews from coming through regardless of the scroll.

Using puppeteer-extra-stealth headfully avoids detection, but headless stealth is still blocked. I'll update if I can find a solution, but I figure this is at least a step forward.

const puppeteer = require("puppeteer-extra"); // ^3.2.3
const StealthPlugin = require("puppeteer-extra-plugin-stealth"); // ^2.9.0
puppeteer.use(StealthPlugin());

let browser;
(async () => {
  browser = await puppeteer.launch({headless: false});
  const [page] = await browser.pages();
  const url = "https://www.coupang.com/vp/products/2275049712?itemId=3903560010";
  await page.goto(url, {waitUntil: "domcontentloaded"});
  await page.waitForSelector(".sdp-review__article__list__review__content");
  await page.waitForNetworkIdle();
  await page.screenshot({path: "screenshot3.png", fullPage: true});
})()
  .catch(err => console.error(err))
  .finally(() => browser?.close())
;

In the future, if you see waitForSelector timeouts when running headlessly, it's a good idea to add a console.log(await page.content()); which will usually show that you've been blocked before you waste time messing with scrolling and other futile strategies.

See also Why does headless need to be false for Puppeteer to work?

ggorlen
  • 44,755
  • 7
  • 76
  • 106
  • Thanks for the response. It does work. I added a scroll to load the reviews and make it completely autonomous. Can you tell me what are the general reasons when it does not work with headless. – ScrapperMaster Sep 11 '22 at 17:55
  • That's a good question. I don't know a huge amount about the mechanics detection (my speciality is working within pages after they've loaded successfully), but for whatever reason, many bot detectors have a much easier time figuring out if you're a bot if you run headlessly than headfully. Probably a good research project and feel free to drop a link or add a [self-answer](https://stackoverflow.com/help/self-answer) if you learn more or figure out how to bypass it headlessly. – ggorlen Sep 11 '22 at 18:09