Web Crawler with Scraper that uses Puppeteer and Scrapy

Question

Please do note that I am a novice when it comes to web technologies. I have to crawl and scrape quite a few websites that are built with a mix of React / javascript / html techs. These sites in all have approx. 0.1 to 0.5 million pages.

I am planning to use Selenium and Scrapy to get the crawling and scraping done. Scrapy alone cannot scrape react pages and using Selenium to scrape regular javascript/html can prove to be very time consuming.

I want to know if there is any way my crawler/scraper can understand what differentiates a react page from a Javascript/html page.

Awaiting response.

try to request JSON urls in case of react pages to avoid using Selenium — Moein Kameli, Dec 04 '19 at 13:01
Does this [discussion](https://stackoverflow.com/questions/54985385/can-i-add-the-id-property-to-an-html-element-created-with-react/54997968#54997968) helps you? — undetected Selenium, Dec 04 '19 at 13:38
Do you plane to write a targeted spider for each site, or perform a broad crawl that only extracts generic information from every page? — Gallaecio, Dec 04 '19 at 13:47
@Piron - I would have to use either a headless browser library in Node / Python nonetheless. I dont see any other way around this. — Sree Nair, Dec 06 '19 at 11:01
DebanjanB - The discussions herewith have not added any resolutions yet. @Gallaecio - The idea is to download all the links available (presumably the hrefs from anchor, area, base tags, etc..) in a website recursively. The code needs to be written in such a way that the page is passed onto either my Node Crawler or my Python Crawler dynamically based on whether the page a react or otherwise. — Sree Nair, Dec 06 '19 at 11:10

ViridTomb · Accepted Answer · 2020-04-15T06:01:49.763

Not sure if this has come too late, but i'll write my two cents on this issue nonetheless.

I have to crawl and scrape quite a few websites that are built with a mix of React / javascript / html techs.

Correct me if I'm wrong, but I believe what you meant by this is that some webpages of that site, contains data of interest (data to be scraped) which are already loaded by HTML, without involving JS. Hence, you wish to differentiate those webpages which you need to use JS rendering apart from those that don't, to improve scraping efficiency.

To answer your question directly, there is no smart system that a crawler can use to differentiate between those two types of webpages without rendering it at least once.

If the URLs of the webpage follow a pattern that enables you to easily discern between pages that use JS and pages that only require HTML crawling:

You can try to render the page at least once and write conditional code around the response. What I mean, is to first crawl the target URL with Scrapy (HTML rendering), and if the response received is incomplete (assuming the invalid response is not due to erroneous element selection code), try crawling it a second time with a JS renderer.

This brings me to my second point. If the webpages have not fixed rendering URL pattern, you can simply try using a faster and more lightweight JS renderer.

Selenium indeed has relatively high overhead when mass crawling (up to 0.5M in your case), since it is not built for it in the first place. You can check out Pyppeteer, an unofficial port of Google's Node.js library Puppeteer in Python. This will allow you to easily integrate it with Scrapy.

Here you can read up on the Pros and Cons of the Puppeteer system VS Selenium, to better calibrate it to your use case. One major limitation is that Puppeteer just supports Chrome for now.

Web Crawler with Scraper that uses Puppeteer and Scrapy

1 Answers1

To answer your question directly, there is no smart system that a crawler can use to differentiate between those two types of webpages without rendering it at least once.