1

I am using puppeteer and node.js to scrape some data, but I am having some problems when trying to loop a list of url:s. When I push the scraped data, I got an error saying that the array is not defined. I think the problem has to do with using await inside the for loop, but I don't really understand why and how to fix it. Why does it says that the array is not defined?

Here is a very simplified version of my code:

const scrapeJobInfo = async (links) => {
  
  /* Initiate the Puppeteer browser */
  const browser = await puppeteer.launch(); 
  const page = await browser.newPage();

  /* Empty array for pushing the data */
  const jobData = [];

  /* For loop that push the data */
  for(let i = 0; i < links.links.length; i++ ) {
    let linkUrl = `${links.links[i]}`
    await page.goto(linkUrl, { waitUntil: 'networkidle0' });
    let companyInfo = await page.evaluate(() => {
      jobData.push('hi') //<-- ReferenceError: jobData is not defined 
    });
  } 

  /* Close browser and log jobData */
  await browser.close();
  console.log(jobData)
};
Hejhejhej123
  • 875
  • 4
  • 12
  • 25
  • How is your client code or how do you call this method? Also could you please provide the specific error message? – Taha Yavuz Bodur Jul 26 '20 at 18:42
  • Hi, right now I don't have any client code. Doing everything in node and triggering the code by going to the api path with postman. The api triggers a code that scrapes the url:s and then calls this function and passes the url:s as an object. The object looks like this: { links: [ 'https://webpage/ad/123', 'https://webpage/ad/124' ] } – Hejhejhej123 Jul 26 '20 at 18:52
  • the error message is "ReferenceError: jobData is not defined" – Hejhejhej123 Jul 26 '20 at 18:53
  • 1
    The code within `page.evaluate` is evaluated (as the name says), and this happens in a different context, anything within that function does not have access to any outer variables. AFAIK the only option is to return the data you want to retrieve from the evaluation step, and then process it outside of the `evaluate` callback. (you can pass variables to `evaluate` as arguments, but I haven't used puppeteer for a while so I don't know if those are copies, but I would guess they are) – t.niese Jul 26 '20 at 19:04
  • 1
    No, this has nothing to do with `await` or the loop. It's just a puppeteer not being able to deal with closures - the function code is stringified and and injected in the page. It cannot have side effects on variables in the node.js program. – Bergi Jul 26 '20 at 19:40
  • @Bergi I also saw the question you linked as duplicate, but I didn't mark it as duplicate, because it is about passing variables to the `evaluate` function and not about passing data back from the `evaluate` function, and from what I recall the variables that are passed are copies. So `jobData.push('hi')` won't work. So while it would solve the `ReferenceError: jobData is not defined` error, it would not solve the actual problem the OP has. – t.niese Jul 27 '20 at 05:42
  • @t.niese I found another one that's closer but not as popular – Bergi Jul 27 '20 at 09:49
  • I managed to solve this. What I did was to return the "hi" in the page.evaluate. Then I pushed the returned value, like this: jobData.push(companyInfo). My question seems locked so I cant answer my question, but this was my answer if someone has the same issue. – Hejhejhej123 Jul 27 '20 at 20:40

1 Answers1

0
let companyInfo = await page.evaluate((jobData) => {
      jobData.push('hi') //<-- ReferenceError: jobData should be defined 
    }, jobData);

Can you please try this? Documentation makes me think this will work for you.

Please have a look here for more information: https://pptr.dev/#?product=Puppeteer&version=v5.2.1&show=api-pageevaluatepagefunction-args

nu_popli
  • 920
  • 1
  • 7
  • 12
  • The error disappeared, but it returned an empty array, no "hi" inside it. – Hejhejhej123 Jul 26 '20 at 19:41
  • The explanations by Bergi must be the answer to this. "the function code is stringified and and injected in the page. It cannot have side effects on variables in the node.js program" Maybe try jobData.push(your_fxn()); and return 'hi' from within the function. – nu_popli Jul 27 '20 at 04:44
  • `Can you please try this? Documentation has makes me think this will work for you.` that's something you should do before answering. – t.niese Jul 27 '20 at 05:19
  • @t.niese sorry for the confusion but I did and it worked (with the same issue he mentioned in the comment) The only reason I said "Can you please try this?" is because I don't have the complete context of his program and was unsure why I wasn't getting the 'hi'. Was it due to incomplete program or something else. That't the only reason I asked him to try. But I get your point. Will be more clear next time. Thanks. – nu_popli Jul 27 '20 at 05:26
  • @nu_popli well then it didn't work, because the desired behavior is that `hi` is added to `const jobData = [];`, and that does not happen. – t.niese Jul 27 '20 at 05:28
  • Yes I understand. As I mentioned, I was unsure whether it was due to puppeteer's behaviour or due to incomplete context. And as I said, got your point. – nu_popli Jul 27 '20 at 05:30
  • If you test it your self then you can create a [mcve] by changing the code until it works as long as `const jobData = [];` and the check if `jobData` is populated correctly is not within the callback passed to `evaluate`, so the argument that you don't know the full context of the question does not really count. `I was unsure whether it was due to puppeteer's behaviour ` understanding the behavior is an important part to give a correct and useful answer, otherwise, you might just give dangerous advice. – t.niese Jul 27 '20 at 05:39
  • Yes mate I have understood your point. Will take care of this moving forward. – nu_popli Jul 27 '20 at 05:42