3

Apify can crawl links from sitemap.xml

const Apify = require('apify');

Apify.main(async () => {
    const requestList = new Apify.RequestList({
        sources: [{ requestsFromUrl: 'https://edition.cnn.com/sitemaps/cnn/news.xml' }],
    });
    await requestList.initialize();

    const crawler = new Apify.PuppeteerCrawler({
        requestList,
        handlePageFunction: async ({ page, request }) => {
            console.log(`Processing ${request.url}...`);
            await Apify.pushData({
                url: request.url,
                title: await page.title(),
                html: await page.content(),
            });
        },
    });

    await crawler.run();
    console.log('Done.');
});

https://sdk.apify.com/docs/examples/puppeteersitemap#docsNav

However, I am not sure how to crawl links from sitemap.xml if I am using requestQueue. For ex:

const requestQueue = await Apify.openRequestQueue();
await requestQueue.addRequest({url: "https://google.com});

 //this is not working. Apify is simply crawling sitemap.xml 
 //and not adding urls from sitemap.xml to requestQueue
 await requestQueue.addRequest({url:`https://google.com/sitemap.xml`});

 const crawler = new Apify.PuppeteerCrawler({
    requestQueue,

    // This function is called for every page the crawler visits
    handlePageFunction: async (context) => {


        const {request, page} = context;

        const title = await page.title();
        let page_url = request.url;
        console.log(`Title of ${page_url}: ${title}`);

        await Apify.utils.enqueueLinks({
            page, selector: 'a', pseudoUrls, requestQueue});
    },


});

await crawler.run();
Ben W
  • 2,469
  • 1
  • 24
  • 24

1 Answers1

5

The great thing about Apify is that you can use both RequestList and RequestQueue together. In that case, items are taken from the list to the queue as you scrape (not overloading the queue). By using both, you get the best from both worlds.

Apify.main(async () => {
    const requestList = new Apify.RequestList({
        sources: [{ requestsFromUrl: 'https://edition.cnn.com/sitemaps/cnn/news.xml' }],
    });
    await requestList.initialize();

    const requestQueue = await Apify.openRequestQueue();

    const crawler = new Apify.PuppeteerCrawler({
        requestList,
        requestQueue,
        handlePageFunction: async ({ page, request }) => {
            console.log(`Processing ${request.url}...`);

            // This is just an example, define your logic
            await Apify.utils.enqueueLinks({
                page, selector: 'a', pseudoUrls: null, requestQueue,
            });
            await Apify.pushData({
                url: request.url,
                title: await page.title(),
                html: await page.content(),
            });
        },
    });

    await crawler.run();
    console.log('Done.');
});

If you want to use just the queue, you will need to parse the XML yourself. Of course, this is not a big issue. You can parse it easily with Cheerio either before the crawler or by using Apify.CheerioCrawler

Anyway, we recommend using RequestList for bulk urls because it is basically instantly created in-memory but the queue is actually a database (or JSON files locally).

Lukáš Křivka
  • 953
  • 6
  • 9
  • thanks. I was trying to get all urls from sitemap.xml. I did not realize that `"requestList.initialize()"` method can download sitemap.xml and parse urls from sitemap.xml `await requestList.initialize();` – Ben W Aug 21 '19 at 19:01
  • Yeah, the magic is in the `requestsFromUrl`. It uses a regex to scan the URL (it can be HTML, CSV, XML, TXT, JSON etc.) for URLs and creates a request list from that. – Lukáš Křivka Aug 23 '19 at 18:27