How to choose a good scraper based on content type?

Question

I want to choose a scraper based on a given URL. but the problem is sometimes beautifulsoup is unable to scrape some js protected pages. In this situation I use selenium instead of bs4. but I want to know how to detect content type in websites and based on that type choose right scraper automatically. Please give me an approach to do this task.

I don't want to recommend a library or something and I mentioned to libraries myself, the question was about an algorithm to separate websites based on static or dynamic content types. — Alireza Mirhabibi - IRAN, Jul 28 '23 at 09:14
What I do is open the page in chrome with network tab open and then search for the content I want. If it's in the main doc I use something like BS and if it's in a fetch possibly something like Selenium — pguardiario, Jul 29 '23 at 00:26

score 1 · Answer 1 · answered Jul 27 '23 at 22:02

1

To choose an efficient scraping tool you need to have a deeper look into the HTML DOM rendered by the browsing client.

Solution

If you have to scrape pages with static elements, Beautifulsoup would provide the much needed precision and performance.

But if the elements on the page are dynamically generated either through:

Then you need to allow the dynamic components to get rendered within the DOM Tree. In those cases, there can't be any better approach then using Selenium.

answered Jul 27 '23 at 22:02

undetected Selenium

183,867
41
278
352

1

Not always the case. If it's dynamically generated I'd first check to see if there's an api to get the data directly, I would then check if it's embedded in a ` – chitown88 Jul 28 '23 at 07:27
Thank you for answer, but my question is how to detect and separate content types from static or dynamic rendered to choose a related scraper automatically? – Alireza Mirhabibi - IRAN Jul 28 '23 at 09:09
I want to find a way to choose between selenium and BS4 based on content type. – Alireza Mirhabibi - IRAN Jul 28 '23 at 09:11
1

_`content type`_ is an abstract word possibly technology _agonistic_ – undetected Selenium Jul 28 '23 at 20:54
I need to find an element to try for detect by beautifulsoup and if it wasn't exist on bs4 try by selenium instead. I prefer to use beautifulsoup because of execution speed and etc at the first order. Please suggest me an html or etc element to use as the Anchor to choose a specific scraper. – Alireza Mirhabibi - IRAN Aug 03 '23 at 08:14
I want to switch between BS4 and Selenium automatically based on Content Type (DOM Tree) – Alireza Mirhabibi - IRAN Aug 03 '23 at 08:23
[BeautifulSoup](https://stackoverflow.com/a/47871704/7429447) and [Selenium](https://stackoverflow.com/a/54482491/7429447) uses different techniques/algorithms. You just can club them up. – undetected Selenium Aug 03 '23 at 20:14

How to choose a good scraper based on content type?

1 Answers1

Solution