0

I am trying to get information from many sites (links from array) which have dynamically content (emails and names of companies) with puppeteer. I use "for" cycle to iterate array with links, do page.goto... to each site, wait until the site is loaded , wait several seconds for dynamical content, and begin doing requests. But i have first and last request completed (Promises resolve). Other promises don't return me dynamical content. What should i do for fix that? Thanks

let puppeteer = require('puppeteer');

(async() => {
const browser = await puppeteer.launch();
let page = await browser.newPage();
const url = 'https://abcdsite.com/';
let arrayNames = ['first','second','third','abcd'];
for(let i=0;i<await arrayNames.length;){
    let nameUrl = await arrayNames[i];
    if (i<4){
      let temp1;
      console.log(`begin for ${nameUrl}`);
      await page.goto(`${url}${nameUrl}`, { waitUntil: 'load' })
          .then(()=>{
            return new Promise(res=>{
              //wait content dynamic load
              setTimeout(()=>{
                temp1 = page.evaluate(() => {
                  return new Promise(resolve => { // <-- return the data to node.js from browser
                    let name = document.querySelector('h1').innerHTML;
                    let email = document.getElementsByClassName('sidebar-views-contacts h-card vcard')[0]
                        .children[2].children[0].children[0].innerHTML;
                    resolve(email);
                  });
                });
                res(temp1);
              },7000);

            })
      })
          .then((res)=>{
            i++;
            console.log(`https://abcdsite.com/${nameUrl}`,temp1);
          });
    }
    else{
      break
    }
  }
})();
diesel94
  • 121
  • 1
  • 11

2 Answers2

2

I think this helps you.

1) make an async function to request and parse your data

2) create an array of parallel tasks.

let puppeteer = require('puppeteer');

async function makeRequest(page, url, nameUrl) {
  await page.goto(`${url}${nameUrl}`, { waitUntil: 'load' });

  setTimeout(() => {
    const userEmail = await page.evaluate(() => {
      let name = document.querySelector('h1').innerHTML;
      let email = document.getElementsByClassName('sidebar-views-contacts h-card vcard')[0]
        .children[2].children[0].children[0].innerHTML;

      return email;
    });

    return Promise.resolve(userEmail);
  }, 7000);
}

(async () => {
  const browser = await puppeteer.launch();
  let page = await browser.newPage();
  const url = 'https://abcdsite.com/';
  let arrayNames = ['first', 'second', 'third', 'abcd'];

  let tasks = [];
  for (let i = 0; i < arrayNames.length; i++) {
    tasks.push(makeRequest(page, url, arrayNames[i]));
  }

  Promise.all(tasks)
    .then((res) => {
      for (let i = 0; i < arrayNames.length; i++) {
        console.log(`https://abcdsite.com/${arrayNames[i]}`, res[i]);
      }
    });

})();

Series solution

For more information read this.

for (let i = 0; i < arrayNames.length; i++) {
  let temp = await makeRequest(page, url, arrayNames[i]);
  console.log(`https://abcdsite.com/${arrayNames[i]}`, temp);
}
Saeed
  • 5,413
  • 3
  • 26
  • 40
  • it writes me ```Possible EventEmitter memory leak detected. 11 Symbol(Events.FrameManager.FrameDetached) listeners added to [FrameManager]. Use emitter.setMaxListeners() to increase limit``` – diesel94 Apr 24 '20 at 12:15
  • How many pages you trying to fetch? Check [this](https://stackoverflow.com/questions/9768444/possible-eventemitter-memory-leak-detected) question and its answers. @diesel94 – Saeed Apr 24 '20 at 12:19
  • i try to load 1600 pages . In your article they told about slimerJS. I have tried to add ```require('events').EventEmitter.prototype._maxListeners = 1700;``` but not success( – diesel94 Apr 24 '20 at 12:38
  • 1600 pages!! That's a lot. You should use another approach. Is it important to get them as fast as possible?? You can do it series. Or you can fetch N pages in each task! @diesel94 – Saeed Apr 24 '20 at 12:44
  • no, i can wait for an hour if needed) i realize that it is hard to do this fast, so i want to do that in any way – diesel94 Apr 24 '20 at 12:48
  • hey, i have tried to do that for 10 pieces and it works **but** it is ```undefined``` for ```res[i]``` in the console – diesel94 Apr 24 '20 at 12:51
  • There was a missing `await` in **makeRequest** function. copy it again. @diesel94 – Saeed Apr 24 '20 at 12:54
  • it worked without async, but ```res[i]``` the same for every item and = last email of the list which consoled. with async it ```...``` for every item it is mean that emails had not been loaded on sites dynamically when we got them – diesel94 Apr 24 '20 at 13:07
  • Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/212418/discussion-between-saeed-and-diesel94). – Saeed Apr 24 '20 at 13:09
1

puppeteer's page.goto function has multiple parameters you can use to ensure that the page is fully loaded. See the documentation here. In addition, you can use the page.waitFor method to wait for a few seconds. See documentation here.

Here you have a simple example that I think may work for you:

const puppeteer = require('puppeteer')

const url = 'https://stackoverflow.com/'
const arrayNames = ['tags', 'users', 'jobs', 'questions'];

(async () => {
  const browser = await puppeteer.launch()
  const page = await browser.newPage()

  const data = {}
  for (const nameUrl of arrayNames) {
    const fullUrl = `${url}${nameUrl}`
    console.log(`begin for ${fullUrl}`)
    await page.goto(fullUrl, { waitUntil: 'networkidle0' }) // check networkidle0 parameter and others here: https://pptr.dev/#?product=Puppeteer&version=v2.1.1&show=api-pagegotourl-options
    await page.waitFor(2000) // wait 2 seconds to allow a full login. Optional
    const pageData = await page.evaluate(() => {
      const name = document.querySelector('h1').innerText
      const pageTitle = document.querySelector('title').innerText
      // get whatever data you need to get from the page.
      return { name: name, title: pageTitle }
    })
    console.log('\t Data from page: ', pageData)
    data[fullUrl] = pageData
  }
  console.log(data)
})()

This does not run all sites in parallel, but you can then play around with the example. Instead of 'awaiting' the await page.evaluate part, you could get all the promises in an array and then use await Promise.all([listOfPromises])

charly rl
  • 845
  • 2
  • 7
  • 15
  • i have tried this way, but it doesn't give aswers for all data, only for data from sites (which arrived in time) , for another sites it is not dynamic data arrived in time and i get only stub '...' when data has not loaded on site yet. i get smth like that ```abcdsite.com/spurit ⋯ abcdsite.com/senticode ⋯ abcdsite.com/instinctools ⋯ abcdsite.com/codemaster info@codemaster.by abcdsite.com/dewpoint ⋯ abcdsite.com/01d ⋯ abcdsite.com/12devs ⋯ abcdsite.com/onepoint ⋯ ``` – diesel94 Apr 24 '20 at 19:25
  • ```tempPage.waitFor(7000);``` not work for me at all ( – diesel94 Apr 24 '20 at 20:13
  • i have checked it with {headless: false} – diesel94 Apr 24 '20 at 20:27
  • Hi @diesel94, I am not sure I understood your issue. Maybe you can provide a URL example where this can be reproduced. Maybe, the javascript you are running inside the page.evaluate function, isn't working on all pages. The provided example should work on any valid page. – charly rl Apr 24 '20 at 20:48
  • It works at all pages, but always for different) somitimes for 20%, sometomes for 50, 90%. Example : dev.by/onepoint; dev.by/spurit etc. – diesel94 Apr 24 '20 at 21:08
  • Sorry, still not sure what is happening in your case. If I use the code above, and replace the URLs with the ones you provided, it still works for me. `const url = 'https://dev.by/';` `const arrayNames = ['news', 'spurit', 'onepoint', 'questions'];` This is what I get: `{'https://dev.by/news':{name: 'Новости', title: 'Новости | dev.by' },'https://dev.by/spurit':{name: 'dev.by', title: '\n ИТ в Беларуси | dev.by\n' },'https://dev.by/onepoint':{name: 'dev.by', title: '\n ИТ в Беларуси | dev.by\n' },'https://dev.by/questions':{name: 'dev.by', title: '\n ИТ в Беларуси | dev.by\n' } }` – charly rl Apr 24 '20 at 21:39
  • You have parsed static content. Look at email, and company name throw ‘’’email’’’ variable, at my code at the topic. It is dynamically content, which will be like “...” with your way. – diesel94 Apr 25 '20 at 06:27
  • Hi. Actually, I didn't test the email part because the urls you provided are returning a 404 not found page. The only one that works is https://dev.by/news and the selector `document.getElementsByClassName('sidebar-views-contacts h-card vcard')[0]` doesn't return anything. Maybe you are testing with other pages... ? – charly rl Apr 26 '20 at 08:02
  • i am sorry ! The right way is with prefix “companies” like ‘’’https://companies.dev.by/spurit’’’. Check it now please – diesel94 Apr 26 '20 at 14:26
  • Hi again. I tryied the code with this URLS: `const url = 'https://companies.dev.by/' const arrayNames = ['spurit', 'senticode', 'instinctools', 'codemaster', 'dewpoint', '01d', '12devs', 'onepoint'];` and it works just fine. I don't know if you have modified the code, but for me it works just fine as it is. I also added the email `const email = document.getElementsByClassName('sidebar-views-contacts h-card vcard')[0].children[2].children[0].children[0].innerText; return { name: name, title: pageTitle, email: email }` – charly rl Apr 27 '20 at 10:59
  • Hmm, the difference in your and my code is in cycle “for”. I use ‘’’for ( , , )’’’ loop. I will check exactly your code above tomorrow – diesel94 Apr 28 '20 at 05:53
  • Man!! your method is pefect! I changed ```for(let i =0 ...``` loop to ```for(const url of array)``` and all works pefect. I dont know what the difference between that methods but Thank you very much – diesel94 Apr 29 '20 at 11:26
  • oh, i have tried it for 10 items and it is works fine, but when i try it for 20+ items it again does not have time to capture data( – diesel94 Apr 30 '20 at 08:13
  • i copy/paste your code and in the console i get ```begin for spurit Data from page: { name: '"SpurIT"', email: 'contact@spur-i-t.com' } begin for senticode Data from page: { name: '"Сэнтикод"', email: '⋯' } begin for instinctools Data from page: { name: '*instinctools', email: '⋯' } begin for codemaster Data from page: { name: '//CODEMASTER', email: 'info@codemaster.by' } begin for dewpoint Data from page: { name: '°dewpoint ', email: '⋯' } begin for 01d Data from page: { name: '01D', email: '⋯' } ``` – diesel94 Apr 30 '20 at 08:31