8

I am trying to scrape some startups data of a site with puppeteer and when I try to navigate to the next page the cloudflare waiting screen comes in and disrupts the scraper. I tried changing the IP but its still the same. Is there a way to bypass it with puppeteer.

(async () => {

  const browser = await puppeteer.launch({
    headless: false,
    defaultViewport: null,
  });

  const page = await browser.newPage();

  page.setDefaultNavigationTimeout(0);

  let links = [];

  // initial page

  await page.goto(`https://www.startupranking.com/top/india`, {
    waitUntil: "networkidle0",
  });

  // looping through the url to different pages

  for (let i = 2; i <= 7; i++) {
    if (i === 3) {
      console.log("waiting");

      await page.waitFor(20000);

      console.log("waited");
    }

    const onPageLinks = await page.$$eval("tr .name a", (arr) =>
      arr.map((cur) => cur.href)
    );

    links = links.concat(onPageLinks);

    console.log(onPageLinks, "inside loop");

    await page.goto(`https://www.startupranking.com/top/india/${i}`, {
      waitUntil: "networkidle0",
    });
  }

  console.log(links, links.length, "outside loop");
})();

As it is only checking for the first loop i put in a waitFor to bypass the time it takes to check, it works fine on some IP's but on others it gives challenges to solve, I have to run this on a server so I am thinking of bypassing it completely.

Yukulélé
  • 15,644
  • 10
  • 70
  • 94
atul-gairola
  • 139
  • 1
  • 2
  • 6
  • 1
    Have you already tried the answers on the existing `[puppeteer] [recaptcha]` and `[puppeteer] [captcha]`questions? Especially this one: https://stackoverflow.com/a/55500565/12412595 – theDavidBarton Jul 06 '20 at 12:17
  • 1
    They will throw up a captcha if the ip is suspicious. Probably any datacenter ip would get one. – pguardiario Jul 08 '20 at 02:16
  • @theDavidBarton The SO page you linked is 100% unrelated to the OP question. The challenge here is the CloudFlare browser validation system, not a captcha/recaptcha system. The CloudFlare protection page in front of many sites these days just runs a series of JS tests on the client to determine if it's a real browser. Since Puppeteer uses Chromium, it's a real browser, and should be able to get past, but it's not. – tpartee Nov 28 '20 at 21:14
  • 2
    I'm also looking for a solution to this issue, will let you know if I find anything. My Puppeteer using the latest Chromium plus the Extra-Stealth module is just spinning on the CloudFlare challenge re-loading it every few seconds and not getting past. Even in non-headless mode I'm seeing this. – tpartee Nov 28 '20 at 21:16
  • 1
    @tpartee I'm currently working with playwright (the microsoft project based on puppeteer) and it doesn't seem to get past the cloudflare protection. Did you find a solution? – trixn Mar 12 '21 at 12:07
  • 6
    @trixn The only workaround that I could make work involved using a non-headless browser over the same IP address and snapshotting the cookie info from the site (which included the CloudFlare cookies) and then using those in the cookie jar for my Puppeteer and Perl scripts. It's less than ideal because it's not fully-automated, but it did work for me. Those cookies appear to be good for at least 3 months, so every 3 months I have to just manually get/set them again. – tpartee Mar 13 '21 at 17:20
  • @tpartee Thanks a lot for your response. As cloudflare is probably throwing a lot of money on making those i-am-a-human checks reliable this seems to be a reasonable way to bypass it. – trixn Mar 15 '21 at 10:15
  • @trixn Sure thing, this is an ever-evolving arms race. Hackers always seem to be able to stay a step ahead, but every now and then something comes along that can't be beaten, like the v3 Captcha that Google devised. They have a really unfair advantage where they can use their ocean of data to determine "user IPs" from "server IPs" and then further look at activity and determine whether to block or not on a request-by-request basis. Someone will create a network to defeat it, but the complexity of solutions is growing every month. – tpartee Mar 16 '21 at 16:13
  • @tpartee Yeah, after all this is what they are selling. So maybe I shouldn't be trying to beat them reliably with my limited amount of budget, time and effort. My initial expectation was that as it's a real browser controlled by playwright it might just work, but apparently that was a bit optimistic. What I thought of to at least partially automate the process is to somehow show those cloudflare protected sites to a real user of our platform in certain intervals and let them obtain the cookies needed. But I guess this would have to be a browser extension as those cookies are probably httpOnly. – trixn Mar 16 '21 at 16:21

0 Answers0