2

I am trying to extract using Puppeteer the title of this page: https://www.nordstrom.com/s/zella-high-waist-studio-pocket-7-8-leggings/5460106

I have the below code,

          (async () => {
            const browser = await puppet.launch({ headless: true });
            const page = await browser.newPage();
            await page.goto(req.params[0]); //this is the url
            title = await page.evaluate(() => {
              Array.from(document.querySelectorAll("meta")).filter(function (
                el
              ) {
                return (
                  (el.attributes.name !== null &&
                    el.attributes.name !== undefined &&
                    el.attributes.name.value.endsWith("title")) ||
                  (el.attributes.property !== null &&
                    el.attributes.property !== undefined &&
                    el.attributes.property.value.endsWith("title"))
                );
              })[0].attributes.content.value ||
                document.querySelector("title").innerText;
            });

which I have tested using the browser console and even using the { headless: false } option of Puppeteer. It works as expected in the browser, but when I actually run it with node it gives me the following error.

10:54:21 AM web.1 |  (node:10288) UnhandledPromiseRejectionWarning: Error: Evaluation failed: TypeError: Cannot read property 'attributes' of undefined
10:54:21 AM web.1 |      at __puppeteer_evaluation_script__:14:20

So, when I run the same Array.from ...querySelectorAll("meta")... query in the browser I get the expected string:

"Zella High Waist Studio Pocket 7/8 Leggings | Nordstrom"

I'm starting to think I'm doing something wrong with the async promises, as that is the part that is different. Can anyone point me in the right direction?

EDIT: As suggested, I tested using document.title, which should be there, but it also returned null. See code and log below:

          console.log(
            "testing the return",
            (async () => {
              const browser = await puppet.launch({ headless: true });
              const page = await browser.newPage();
              await page.goto(req.params[0]); //this is the url
              try {
                title = await page.evaluate(() => {
                  const title = document.title;
                  const isTitleThere = title == null ? false : true;
                  //recently read that this checks for undefined as well as null but not an
                  //undeclared var
                  return {
                    title: title,
                    titleTitle: title.title,
                    isTitleThere: isTitleThere,
                  };
                });
              } catch (error) {
                console.log(error, "There was an error");
              }
11:54:11 AM web.1 |  testing the return Promise { <pending> }
11:54:13 AM web.1 |  { title: '', isTitleThere: true }

Does this have to do with single-page application bs? I thought puppeteer handled that because it loads everything first.

EDIT: I have added the networkidle lines and await 8000 milliseconds, as suggested. Title is still empty. Code below and log:

            await page.goto(req.params[0], { waitUntil: "networkidle2" });
            await page.waitFor(8000);
            console.log("done waiting");
            title = await page.$eval("title", (el) => el.innerText);
            console.log("title: ", title);
            console.log("done retrieving");
12:36:39 PM web.1 |  done waiting
12:36:39 PM web.1 |  title:  
12:36:39 PM web.1 |  done retreiving

EDIT: PROGRESS!! Thank you to theDavidBarton. It seems headless has to be false for it work? Does anyone know why?

Qrow Saki
  • 932
  • 8
  • 19

3 Answers3

1

when navigating to the page wait until the page is loaded

await page.goto(req.params[0], { waitUntil: "networkidle2" }); //this is the url

Could you try this

 try {
    title = await page.evaluate(() => {
        const title = document.title;
        const isTitleThere = title == null? false: true
        //recently read that this checks for undefined as well as null but not an 
        //undeclared var
        return {"title":title,"isTitleThere" :isTitleThere }
    })

} catch (error) {
    console.log(error, 'There was an error');

}

or this

 try {
title = await page.evaluate(() => {
    const title = document.querySelector('meta[property="og:title"]');
    const isTitleThere = title == null? false: true
    //recently read that this checks for undefined as well as null but not an 
    //undeclared var
    return {"title":title,"isTitleThere" :isTitleThere }
   })

   } catch (error) {
   console.log(error, 'There was an error');

   }
  • I tried the first one. It returned true :( but there's definitely a document title in the page I'm looking at. – Qrow Saki Sep 09 '20 at 18:43
  • you can access the title like so `title.title` –  Sep 09 '20 at 18:47
  • I'm not. Should I be? :0 I only want this one function to be asynchronous. It can do the rest while it's waiting, is what I thought. Is this wrong? Should I be wrapping my entire code in an async func? – Qrow Saki Sep 09 '20 at 18:59
  • Why networkidle2 specifically and not networkidle0 or 1? – Qrow Saki Sep 09 '20 at 19:22
  • 1
    I used the solution from this url [Puppeteer wait until page is completely loaded - Stack Overflow](https://stackoverflow.com/questions/52497252/puppeteer-wait-until-page-is-completely-loaded) when I ran into that problem –  Sep 09 '20 at 19:31
1

If you only need the innerText of title you could do it with page.$eval puppeteer method to achieve the same result:

const title = await page.$eval('title', el => el.innerText)
console.log(title)

Output:

Zella High Waist Studio Pocket 7/8 Leggings | Nordstrom

page.$$eval(selector, pageFunction[, ...args])

The page.$eval method runs Array.from(document.querySelectorAll(selector)) within the page and passes it as the first argument to pageFunction.


However: your main problem is that the page you are visiting is a Single-Page App (SPA) made in React.Js, and its title is filled dynamically by the JavaScript bundle. So your puppeteer finds a valid title element in the <head> when its content is simply: "" (an empty string).

Normally you should use waitUntil: 'networkidle0' in case of SPAs to make sure the DOM is populated by the actual JS framework properly and it is fully functional:

await page.goto('https://www.nordstrom.com/s/zella-high-waist-studio-pocket-7-8-leggings/5460106', {
    waitUntil: 'networkidle0'
  })

Unfortunately with this specific website it throws a timeout error as the network connections don't close until the 30000 ms default timeout, something seems to be not OK on the webpage's frontend side (webworker handling?).

As a workaround you can force puppeteer sleep for 8 seconds with: await page.waitFor(8000) before you try to retrieve the title: by that time it will be properly populated. Actually when you run your script in DevTools Console it works because you are not immediately running the script: that time the page is already fully loaded, DOM is populated.

This script will return the expected title:

async function fn() {
  const browser = await puppeteer.launch({ headless: false })
  const page = await browser.newPage()

  await page.goto('https://www.nordstrom.com/s/zella-high-waist-studio-pocket-7-8-leggings/5460106', {
    waitUntil: 'networkidle2'
  })
  await page.waitFor(8000)

  const title = await page.$eval('title', el => el.innerText)
  console.log(title)

  await browser.close()
}
fn()

Maybe const browser = await puppeteer.launch({ headless: false }) affects the result as well.

theDavidBarton
  • 7,643
  • 4
  • 24
  • 51
  • It still returns empty, even with the networkidle and 8000. Is it possible its not fully loaded even after those waits? Or am I doing something else wrong? – Qrow Saki Sep 09 '20 at 19:31
  • how do you use networkidle? if you use networkidle0 your whole script may fail. my script is only this 3 lines (after the page.goto) and it gives back the title currently. – theDavidBarton Sep 09 '20 at 19:34
  • 1
    I tried networkidle2 and networkidle0. See edit. Same result. If you say yours is getting back the title, then its probably my other parts of the code messing things up, since we have the same thing. I will get rid of those and see if it still causes a problem. Thanks for all the help! – Qrow Saki Sep 09 '20 at 19:42
  • 1
    @QrowSaki I have added my whole script for clarity in the end. I think the game changer is `{ headless: true }` changing to `{ headless: false }`. it worths an investigation why it results in different results. glad that I could help a little. – theDavidBarton Sep 09 '20 at 19:48
  • It worked! Thank you! Do you think it matters if headless is false? I'm making a web api. I don't need the UI. If I leave the headless: false in there, will that be constantly opening up Chromium? – Qrow Saki Sep 09 '20 at 19:54
  • 1
    yes, it is the "headful" chrome. the problem is: this site can be automated/scraped only if the browser is not headless (at least it seems to be the limitation so far). you could try to use puppeteer-extra with additional plugin called _stealth_ to pretend your chrome is a headful instance - without launching the UI: https://www.npmjs.com/package/puppeteer-extra-plugin-stealth if it worths the effort for you (and the additional dependencies on your project). – theDavidBarton Sep 09 '20 at 20:01
  • Thank you! I will look into that! – Qrow Saki Sep 09 '20 at 20:09
0

Answering this as a canonical, there are a number of reasons why Puppeteer might behave differently in Puppeteer than in the browser dev tools:

  • By the time you start punching queries into dev tools, the page is usually fully loaded. This isn't necessarily the case in Puppeteer, where the concept of "page is fully loaded" is nebulous. Generally, page.waitForSelector is the solution, but sometimes more drastic measures are necessary. page.waitForTimeout is a poor solution because it causes a race condition and slows the script down unnecessarily, but it can be helpful for initial debugging before tightening the predicate.
  • In the browser dev tools, iframes and shadow roots are automatically expanded, allowing you to select things Puppeteer can't by default.
  • Servers have methods of detecting bots, preventing you from accessing the site or changing the behavior of the page in unexpected ways.
  • Servers have methods of detecting bots running Puppeteer headlessly, but not headfully. If you can't find a selector, try launching Puppeteer with puppeteer.launch({headless: false});.
  • Elements may have visibility characteristics that Puppeteer handles differently than the browser. For example, a native .click() call can work on something that's scrolled out of view or has no width and height. But Puppeteer's page.click() might be unable to click the element. page.click() issues a series of mouse commands to try to click the element in a trusted manner, as a user would. This applies to page.type and other Puppeteer API methods.
  • Pages can initiate long-running requests that cause "networkidle0" to never resolve, leading to navigation timeouts that might not cause problems in the browser.

Many of these issues can be debugged by logging console.log(await page.content()) right after your await page.goto(url, {waitUntil: "domcontentloaded"}). This can generally show you whether the site has blocked you or whether the selector simply hasn't shown up yet. If you need to search this static HTML string for your selector, Cheerio might be a useful option, although I don't recommend using it with Puppeteer in the common case.

Checking for iframes and shadow roots can be done in dev tools, but is easy to miss if you're zoomed in on a particular deeply-nested element. Walk up the parent nodes to make sure they're all normal HTML elements.

ggorlen
  • 44,755
  • 7
  • 76
  • 106