Pros and Cons of Python Web Scraping using BeautifulSoup vs XPath

Question

I've been learning about web scraping using BeautifulSoup in Python recently, but earlier today I was advised to consider using XPath expressions instead.

How does the way XPath and BeautifulSoup both work differ from each other?

I'm trying to figure out what would be the benefits of me using one instead of the other. Surely there's a difference between someone saying "BS is what I use because I think overall it is better" and what I'm actually asking for which is "beneficial features of using XPath would be ..." — DanielSon, Oct 02 '15 at 17:30
both recommend different ways to go and the latter answer also contradicts itself by saying *Having said that, I find it often easier to write a bs4 snippet than the corresponding lxml.* so even they cannot actually make up their mind totally which is *better*. What would you consider would make one better than the other, speed, ease of use, code readability or what exactly is your criteria for defining the best? — Padraic Cunningham, Oct 02 '15 at 17:37
In my question I didn't ask which one was better. If you read it through I literally only asked to learn the differences between them and the pros and cons of each option. My question encourages contradiction, I wasn't looking for a unified set of advantages of one or the other. More to the point I think the answers to this question would not only be beneficial to myself (as they already have been) but also to others in my situation in the future. — DanielSon, Oct 02 '15 at 23:37

score 5 · Answer 1 · edited May 23 '17 at 12:31

I have used both BeautifulSoup and lxml and incline towards the use of lxml based on experience. See performance comparison here. One thing to be wary of when using BeautifulSoup is the explicit election of a parser. The default parser chosen for you may incorrectly parse results without warnings that can lead to nightmares - my experience here.

Having said that, I find it often easier to write a bs4 snippet than the corresponding lxml.

score 4 · Answer 2 · answered Oct 02 '15 at 16:59

I would suggest bs4, its usage and docs were more friendly, will save your time and increase confidence which is very important when you are self learning string manipulation.

However in practice, it will require a strong CPU. I once scrape with not more than 30 connections on my 1core VPS, and CPU usage of python process keeps at 100%. It could be result of bad implementation, but later I chaned all to re.compile and performance issue was gone.

As for performance, regex > lxml >> bs4. As for get things done, no difference.

Pros and Cons of Python Web Scraping using BeautifulSoup vs XPath

2 Answers2