1

I'm trying to get description after clicking on every job listing. The below code seems to work but it takes and repeats few first job descriptions and omits the rest. I was trying adding a delay, because I thought that could be the cause, but it is still not working.

const puppeteer = require("puppeteer");
const fs = require("fs/promises");

(async () => {
  const browser = await puppeteer.launch({
    headless: false,
    defaultViewport: {width: 1280, height: 800},
  });

  const page = await browser.newPage();
  await page.goto(
    `https://www.indeed.com/jobs?q=data+analyst&l=New+York%2C+NY&vjk=fb43dfe81685438a`
  );
  const jobs = await page.$$(".jobsearch-ResultsList > li");

  for (const job of jobs) {
    try {
      await job.click();
      await page.waitForSelector("#jobDescriptionText");
      // await page.click('.resultContent');
      const job_description = await page.$eval(
        "#jobDescriptionText",
        el => el.textContent
      );
      await fs.appendFile("jobs", job_description);
    } catch (error) {}
  }
  await browser.close();
})();
ggorlen
  • 44,755
  • 7
  • 76
  • 106
  • Please share a [mcve] with the page you're working with. It's hard to help otherwise. Thanks. As for this code, you loop `const job of jobs` but never use `job` anywhere in the loop. – ggorlen Apr 03 '23 at 16:09
  • @ggorlen thanks, I have worked a bit on that, now I'm using job variable which makes sense. I am using indeed website, but still got an issue with my code – niknoy nori Apr 05 '23 at 12:37
  • What issue do you have with your code? What information are you trying to get ultimately? – ggorlen Apr 05 '23 at 15:22
  • I'm trying to enter each job from first page and scrap job description. This is just an exercise for me to learn puppeteer. For now code extracts job description twice, and half of the listings are ignored – niknoy nori Apr 06 '23 at 07:28

1 Answers1

1

I'm seeing varying behavior here depending on whether I run headlessly or not. If I run headfully, clicking each link seems to pop open a new page, which I can capture, then pull the job's description:

const puppeteer = require("puppeteer"); // ^19.7.5

const url = "<Your URL>";

let browser;
(async () => {
  browser = await puppeteer.launch({headless: false});
  const [page] = await browser.pages();
  await page.goto(url, {waitUntil: "domcontentloaded"});
  await page.waitForSelector("a.jcs-JobTitle");
  const descriptions = [];

  for (const job of await page.$$("a.jcs-JobTitle")) {
    await job.click();
    const newTarget = await browser.waitForTarget(target =>
      target.opener() === page.target()
    );
    const newPage = await newTarget.page();
    const el = await newPage.waitForSelector("#jobDescriptionText");
    descriptions.push(await el.evaluate(e => e.textContent.trim()));
    await newPage.close();
  }

  console.log(descriptions);
})()
  .catch(err => console.error(err))
  .finally(() => browser?.close());

But when I run headlessly the selector doesn't seem to show up.

When I run Firefox manually, I see that the description appears in a sidebar that doesn't involve a navigation, so there's clearly some variable behavior at hand.

One workaround that should handle both cases is to grab the URL from the link for each description, then use page.goto() to navigate to it, hopefully bypassing the click behavior differences:

// ...
  browser = await puppeteer.launch();
  const [page] = await browser.pages();
  const ua = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3359.181 Safari/537.36";
  await page.setUserAgent(ua);
  await page.goto(url, {waitUntil: "domcontentloaded"});
  await page.waitForSelector("a.jcs-JobTitle");
  const descriptions = [];
  const hrefs = await page.$$eval(
    "a.jcs-JobTitle",
    els => els.map(e => e.href)
  );

  for (const href of hrefs) {
    await page.goto(href, {waitUntil: "domcontentloaded"});
    const el = await page.waitForSelector("#jobDescriptionText");
    descriptions.push(await el.evaluate(e => e.textContent.trim()));
  }

  console.log(descriptions);
// ...

Note that I'm using a user agent header to avoid detection in headless mode.

ggorlen
  • 44,755
  • 7
  • 76
  • 106