0

I would like to collect some info from a page. First I check with Chrome inspect and console how to find the right value and it was everything ok. Then I paste the code into puppeteer, cheerio environment and some reason I can not collect the right data.

This is the part what is working in chrome:

const modellek = $('[columntype="model"] > section > ul > li').map(function() {
                 return ($(this).text())});

console.log(modellek)
["txt1","txt2","txt3","txt4"...]

The JS script is the following:

const puppeteer = require("puppeteer");
const cheerio = require("cheerio");

async function scrapHome(url){
    try{
        const browser = await puppeteer.launch({headless: false});
        const page = await browser.newPage();
    
        await page.setViewport({width: 1366, height: 768});
        await page.setUserAgent('Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.108'); 
        

        const html = await page.evaluate(() => document.body.innerHTML);
        const $ = await cheerio.load(html);
        await page.goto(url);

             
        const models= $('[columntype="model"] > section > ul > li').map(function() {
                      return ($(this).text().get())});

        console.log(models)

    } catch (err) {
        console.error(err);
    };
};

scrapHome("https://example.com/");

But the result is an empty array : [].

I also tried the waitForSelector but in that case there is no any response.

page
    .waitForSelector('[columntype="model"]')
    .then(() => $('[columntype="model"] > section > ul > li').map(function() {
                      console.log ($(this).text())
     }));

Any idea how to get the requested info?

Denes
  • 111
  • 7
  • 2
    Using cheerio with Puppeteer is somewhat weird. Either the content you want is dynamic or it is static. If it's static, you can use cheerio. If it's dynamic, use Puppeteer. Basically, importing Puppeteer then using cheerio to do the scraping is like buying a bike, then carrying it around instead of riding it. If the content is dynamic, skip cheerio and use Puppeteer selectors like `page.$("your selector")`. Is example.com really the page you're scraping? If not, please share the actual page and show the data you want. – ggorlen Jul 02 '21 at 17:06
  • See also: [How can I scrape pages with dynamic content using node.js?](https://stackoverflow.com/questions/28739098/how-can-i-scrape-pages-with-dynamic-content-using-node-js) – ggorlen Jul 02 '21 at 17:34
  • 1
    @ggorlen That is an awesome metaphor with the bike! – Vaviloff Jul 02 '21 at 18:24
  • Thanks, I see. I need to move on with puppeteer. – Denes Jul 02 '21 at 19:33

2 Answers2

0

First you need to actually go to a page

await page.goto(url);

And only then get the HTML of that page:

const html = await page.evaluate(() => document.body.innerHTML);

Also, depending on the site you're working with, it is possible those models will not available when you load the web page right away (gor example if they are generated with a js script or loaded via ajax).

In this case you should wait for the desired element to appear on the page:

await page.waitForSelector('[columntype="model"] > section > ul > li');
const html = await page.evaluate(() => document.body.innerHTML);
Vaviloff
  • 16,282
  • 6
  • 48
  • 56
  • 1
    Thanks. Meanshwile I read about cheerio. And because of the content is dynamic I can not reach it with cheerio. I need to use only puppeteer. But it raise up a new question. How to get all the li elements? In you example only print the first one. – Denes Jul 02 '21 at 12:52
0

In chrome console you would do:

$$('[columntype="model"] > section > ul > li').map(li => li.innerText)

in Puppeteer you would do:

page.$$eval('[columntype="model"] > section > ul > li', lis => lis.map(li => li.innerText))
pguardiario
  • 53,827
  • 19
  • 119
  • 159