0

I am scraping linkedin's job section with cheerio, for example the following link:

https://www.linkedin.com/jobs/search/?f_TPR=r86400&geoId=105080838&keywords=Full%20Stack&location=New%20York%2C%20United%20States

If I browse the link using chrome it splits the jobs by pages, but when I browse this in Microsoft Edge (I would like you to try also just to see), it loads more jobs just if I scroll down to the bottom of the page. My assumption is that cheerio is using Microsoft Edge behind the scenes but I am not sure about it and I don't know how to change it and if it's even a good idea.

I would like to ask what are my options in this situation when I try to scrape all of the jobs, also those who are dynamically rendered or those who are in another page.

The code that gives me what I currently have is:

    const LINKEDIN_JOBS_OBJ = await axios.get(
        'https://www.linkedin.com/jobs/search/........');

    const $ = cheerio.load(LINKEDIN_JOBS_OBJ.data);
    const listItems = $('li div a');
    listItems.each(function(idx, el) {
        jobsArr.push($(el).text().replace(/\n/g, '').replace(/\s\s+/g, ' '));
    });

Which gives me only the jobs in first page / first section.

1 Answers1

0

Cheerio is not using any browser behind the scenes. It just parses HTML text into DOM objects that you can inspect, but it won't execute javascript to load dynamic content.

If you want to load data from other pages, you'll need to use cheerio to find <a> tags on the page, and then use axios to send requests to those URLs, and then use cheerio again to parse the results of those requests.

If you want to parse dynamically rendered content, you'll need a tool that can load pages and run javascript like a browser does. Selenium is a classic choice, but you might prefer puppeteer as a modern alternative. https://github.com/puppeteer/puppeteer

Raphael Serota
  • 2,157
  • 10
  • 17