1

I need to do lot html parsing / scraping /search engine /crawling.

There are many libraries currently like Scrapy, Beautiful Soup, lxml , lxml2 requests, pyquery.

Now i don't want to try each of these and then decide. basically i want to follow on one and then study in detail and then use that most often.

So which library should i go for which can perform all function mentioned above. Even though there may be diff solutions for diff problems. But i want onelibrary which could do all things even though it takes time to code but should be possible

Is it possible to do indexing in lxml? Is PyQuery same as lxml or its different?

user782234
  • 51
  • 4

2 Answers2

1

I'm using Beautiful Soup and am very happy with it. So far it answered all my scraping needs. Two main benefits:

  • It's pretty good at handling non-perfect HTML. Since browsers are quite lax, many HTML documents aren't 100% well-formed
  • In addition to high-level access APIs, it has low-level APIs which make it extendible if some specific scraping need isn't directly provided
Eli Bendersky
  • 263,248
  • 89
  • 350
  • 412
1

Since lots of HTML documents are not well-formed but rather a bunch of tags (sometimes not even properly nested) you probably want to go with BeautifulSoup instead of one of the xml-based parsers.

ThiefMaster
  • 310,957
  • 84
  • 592
  • 636