0

I've written a script in node.js to scrape the links of different titles from a webpage. When I execute my following script, I get undefined printed in the console instead of the links I'm after. My defined selectors are accurate.

I do not wish to put the links in an array and return the results; rather, I wish to print them on the fly. As I'm very new to write scripts using node.js in combination with puppeteer, I can't figure out the mistake I'm making.

This is my script (Link to that site):

const puppeteer = require('puppeteer');
function run () {
    return new Promise(async (resolve, reject) => {
        try {
            const browser = await puppeteer.launch();
            const page = await browser.newPage();
            await page.goto("https://stackoverflow.com/questions/tagged/web-scraping");
            let url = await page.evaluate(() => {
                let items = document.querySelectorAll('a.question-hyperlink');
                items.forEach((item) => {
                    //would like to keep the following line intact 
                    console.log(item.getAttribute('href'));
                });
            })
            browser.close();
            return resolve(url);
        } catch (e) {
            return reject(e);
        }
    })
}
run().then(console.log).catch(console.error);

The following script works just fine if I consider to declare an empty array results and store the scraped links within it and finally return the resultsbut I do not wish to go like this. I would like to stick to the way I tried above, as in printing the result on the fly.

const puppeteer = require('puppeteer');
function run () {
    return new Promise(async (resolve, reject) => {
        try {
            const browser = await puppeteer.launch();
            const page = await browser.newPage();
            await page.goto("https://stackoverflow.com/questions/tagged/web-scraping");
            let urls = await page.evaluate(() => {
                let results = [];
                let items = document.querySelectorAll('a.question-hyperlink');
                items.forEach((item) => {
                    results.push({
                        url:  item.getAttribute('href'),
                    });
                });
                return results;
            })
            browser.close();
            return resolve(urls);
        } catch (e) {
            return reject(e);
        }
    })
}
run().then(console.log).catch(console.error);

Once again: my question is how can I print the link like console.log(item.getAttribute('href')); on the fly without storing it in an array?

SIM
  • 21,997
  • 5
  • 37
  • 109
  • Which console log call gives `undefined` prints? for example, `url` variable will always be undefined because you are not returning anything. – Cristiano Oct 08 '18 at 21:55
  • Possible duplicate of [Communicating between the main and renderer function in Puppeteer](https://stackoverflow.com/questions/52684640/communicating-between-the-main-and-renderer-function-in-puppeteer) – Md. Abu Taher Oct 10 '18 at 12:25

3 Answers3

2

To run console.log() inside evaluate() simply copy the line below where you are defining page

page.on('console', obj => console.log(obj._text));

so now the whole snippet will be like this now

const puppeteer = require('puppeteer');
function run () {
    return new Promise(async (resolve, reject) => {
        try {
            const browser = await puppeteer.launch();
            const page = await browser.newPage();
            page.on('console', obj => console.log(obj._text));
            await page.goto("https://stackoverflow.com/questions/tagged/web-scraping");
            let url = await page.evaluate(() => {
                let items = document.querySelectorAll('a.question-hyperlink');
                items.forEach((item) => {
                    //would like to keep the following line intact 
                    console.log(item.getAttribute('href'));
                });
            })
            browser.close();
            return resolve(url);
        } catch (e) {
            return reject(e);
        }
    })
}
run().then(console.log).catch(console.error);

Hope this help

Atishay Jain
  • 1,425
  • 12
  • 22
  • Thanks for your solution @Atishay Jain. My second script is already working. I do not wish to comply that logic; rather, I want to print the result on the fly like I tried in my first script. – SIM Oct 11 '18 at 16:45
  • sorry got the question wrong the first time i've updated the answer. please check @asmitu – Atishay Jain Oct 12 '18 at 05:47
  • 1
    Now this is more like it. Thanks @Atishay Jain for your working solution. Let's wait few more days while the bounty is on. – SIM Oct 12 '18 at 08:06
1

The library looks a bit awkward to use but found the proper way to get an href from this thread on github- https://github.com/GoogleChrome/puppeteer/issues/628

The working code I have is to use await page.$$eval

async function getStackoverflowLinks(){
  return new Promise(async(resolve, reject)=>{
    console.log(`going to launch chromium via puppeteer`)
    const browser = await puppeteer.launch()
    console.log(`creating page/tab`)
    const page = await browser.newPage()
    await page.goto('https://stackoverflow.com/questions/tagged/web-scraping')
    console.log("fetched SO web-scraping, now parsing link href")

    let matches = await page.$$eval('a.question-hyperlink', hrefs=>hrefs.map((a)=>{
      return a.href
    })) // $$eval and map version, $$eval returns an array
    console.log("matches = ", matches.length)

    await browser.close()
    resolve(matches)
  })
}

getStackoverflowLinks()
.then(hrefs=>{
  console.log("hrefs: ", hrefs)
}) 
Jim Factor
  • 1,465
  • 1
  • 15
  • 24
0

Things to note,

  • async function will return a promise.
  • new Promise will also return a promise.

On that note, you can simply use the .console events to print them on fly. Usage,

page.on("console", msg => console.log(msg.text()));
await page.evaluate(async => {
  console.log("I will be printed on node console too")
})

Advanced usage has been discussed on this answer.

Md. Abu Taher
  • 17,395
  • 5
  • 49
  • 73