If i have to choose only one html scraping library for python, which should i choose

Question

I need to do lot html parsing / scraping /search engine /crawling.

There are many libraries currently like Scrapy, Beautiful Soup, lxml , lxml2 requests, pyquery.

Now i don't want to try each of these and then decide. basically i want to follow on one and then study in detail and then use that most often.

So which library should i go for which can perform all function mentioned above. Even though there may be diff solutions for diff problems. But i want onelibrary which could do all things even though it takes time to code but should be possible

Is it possible to do indexing in lxml? Is PyQuery same as lxml or its different?

http://stackoverflow.com/questions/419235/anyone-know-of-a-good-python-based-web-crawler-that-i-could-use — John Mulder, Jun 06 '11 at 06:58

score 1 · Accepted Answer · answered Jun 06 '11 at 06:44

1

I'm using Beautiful Soup and am very happy with it. So far it answered all my scraping needs. Two main benefits:

It's pretty good at handling non-perfect HTML. Since browsers are quite lax, many HTML documents aren't 100% well-formed
In addition to high-level access APIs, it has low-level APIs which make it extendible if some specific scraping need isn't directly provided

answered Jun 06 '11 at 06:44

Eli Bendersky

263,248
89
350
412

thanks ELI ,also for my indexing and searching is beautiful Soap enough for that – user782234 Jun 06 '11 at 06:53

score 1 · Answer 2 · answered Jun 06 '11 at 06:44

1

Since lots of HTML documents are not well-formed but rather a bunch of tags (sometimes not even properly nested) you probably want to go with BeautifulSoup instead of one of the xml-based parsers.

answered Jun 06 '11 at 06:44

ThiefMaster

310,957
84
592
636

If i have to choose only one html scraping library for python, which should i choose

2 Answers2