1

I was using curl to scrape html code from a certain website. then they changed their server settings and curl no longer can get the page content giving error code 1020 then I changed my script to use elinks.

but again they are now using cloudflare and elinks no longer works (only in this particular website). and it gives the same error code 1020.

is there any command line or option to use other browsers (firefox,chromium, google-chrome...) and get the page html in a terminal ?

Neo Mosaid
  • 367
  • 3
  • 17

2 Answers2

1

If you can write scripts for Node.js, here is a small example using puppeteer library. It logs page source code after the page is loaded in a headless (invisible) Chrome, with dynamic content generated by page scripts:

import puppeteer from 'puppeteer';

const browser = await puppeteer.launch({ headless: false, defaultViewport: null });

try {
  const [page] = await browser.pages();
  await page.goto('https://example.org/');
  console.log(await page.content());

} catch (err) { console.error(err); } finally { await browser.close(); }
vsemozhebuty
  • 12,992
  • 1
  • 26
  • 26
  • unfortunately it gives the same error code 1020 as curl and elinks. either this is a misconfiguration on their part or they are really paranoid . other websites even with cloudflare work just fine – Neo Mosaid Jul 30 '21 at 19:31
  • @NeoMosaid Sometimes launching the browser in headful mode can help. I've edited the second line of code for this. Also you can try [puppeteer-extra-plugin-stealth](https://github.com/berstend/puppeteer-extra/tree/master/packages/puppeteer-extra-plugin-stealth). – vsemozhebuty Jul 30 '21 at 19:38
  • SyntaxError: Cannot use import statement outside a module . sorry I've never used nodejs before – Neo Mosaid Jul 30 '21 at 21:28
  • @NeoMosaid Try to change script file extension to `.mjs` — then Node.js will interpret it as a module with modern syntax. – vsemozhebuty Jul 30 '21 at 21:35
0

I bring to your attention the code and libraries that bypass protection cloudflare:

Libs:

npm i puppeteer-extra puppeteer-extra-plugin-stealth puppeteer

nodejs:

const puppeteer = require('puppeteer-extra')
const pluginStealth = require('puppeteer-extra-plugin-stealth')
const { executablePath } = require('puppeteer')

const link = 'https://www.g2.com/'

const getHtmlThoughCloudflare = async (url) => {
  puppeteer.use(pluginStealth())
  const result = await puppeteer
    .launch({ headless: true })
    .then(async (browser) => {
      const page = await browser.newPage()
      await page.goto(url)
      const html = await page.content()
      await browser.close()
      return html
    })

  console.log(` HTML: ${result}`)
  return result // html
}

getHtmlThoughCloudflare(link)
Roma N
  • 175
  • 11