1

New to Node so this might be an issue of not understanding Node well enough but basically I'm trying to scrape a list of titles on a page using Puppeteer. When I run the query in Chrome console I get a list of titles. Woo!

Array.from(document.querySelectorAll('div.description h3.title')).map(partner => partner.innerText)

(12) ["Jellyfish", "MightyHive", "Adswerve", "55 | fifty-five", "E-Nor", "LiveArea", "Merkle Inc.", "Publicis Sapient", "Acceleration Precision", "Resolute Digital", "PMG", "Kepler Group"]

But when I test it out in VS Code with Node.js I get an empty array

const browser = await puppeteer.launch();
  const page = await browser.newPage();
  const url =
    "https://marketingplatform.google.com/about/partners/find-a-partner?utm_source=marketingplatform.google.com&utm_medium=et&utm_campaign=marketingplatform.google.com%2Fabout%2F";
  await page.goto(url);

  const titles = await page.evaluate(() => 
    Array.from(document.querySelectorAll("h3.title"))
      .map(partner => partner.innerText.trim())
  )

$ Node google-test.js
[]

I've tried further specifying the selector even using the inspect 'copy selector' shortcut for an exact select but still get an empty Array.

If I am more vague such as selecting "h2" I get a result but once I further spec it's over for me. What gives?

SIM
  • 21,997
  • 5
  • 37
  • 109
gobrando
  • 19
  • 5
  • By the way, you don't need the utm parameters in the URL, they're just for tracking purposes. So [https://marketingplatform.google.com/about/partners/find-a-partner](https://marketingplatform.google.com/about/partners/find-a-partner) is enough. – radulfr Nov 19 '19 at 18:58

2 Answers2

5

Because the site loads content in after the page has loaded using XHR, simply add the following:-

await page.waitFor('h3.title'); 

This forces the page to wait until the h3.title is present then you can run your code as is

before

const titles = await page.evaluate(() =>  ...

Then everything should run OK, full script I used:-

'use strict';

const puppeteer = require('puppeteer');

(async() => {
    const browser = await puppeteer.launch({
        headless: false, 
        defaultViewport : { width: 1600, height: 1600}
      });
    const page = await browser.newPage();

  const url =
    "https://marketingplatform.google.com/about/partners/find-a-partner";
  await page.goto(url);

  await page.waitFor('h3.title');  //this is the magic!

  const titles = await page.evaluate(() =>
    Array.from(document.querySelectorAll("h3.title"))
      .map(partner => partner.innerText.trim())
  )
  console.log(titles)
  await browser.close();

})();  

NOTE: I have turned headless mode off and set a wider viewport so I can see what is going on. In production you don't need these settings.

QHarr
  • 83,427
  • 12
  • 54
  • 101
Rippo
  • 22,117
  • 14
  • 78
  • 117
0

It looks like the partner list on the page is loaded dynamically via JS; on Chrome, left click and select "View source..." to see what is actually loaded at the beginning.

The partner list seems to be lazily loaded on scrol... you may need to somehow simulate the scrolling and waiting for the lazy parts of the page to load in order to get the data you want.

Haroldo_OK
  • 6,612
  • 3
  • 43
  • 80