Going from Ruby to Python : Crawlers

Question

I've started to learn python the past couple of days. I want to know the equivalent way of writing crawlers in python.

so In ruby I use:

nokogiri for crawling html and getting content through css tags
Net::HTTP and Net::HTTP::Get.new(uri.request_uri).body for getting JSON data from a url

what are equivalents of these in python?

@zsquare - Scrapy is an interesting project, but unfortunately it doesn't address the question. — pguardiario, Oct 15 '12 at 14:02

score 4 · Answer 1 · answered Oct 15 '12 at 07:40

4

Between lxml and beautiful soup, lxml is more equivalent to nokogiri because it is based on libxml2 and it has xpath/css support.
The equivalent of net/http is urllib2

answered Oct 15 '12 at 07:40

pguardiario

53,827
19
119
159

score 3 · Accepted Answer · edited May 23 '17 at 12:13

Well

Mainly you have to separate the 'scraper'/crawler the python lib/program/function that will download the files/data from the webserver and the Parser that will read this data and interpret the data. In my case I had to scrap and get some govt info that is 'open' but not download/data friendly. For this project I used scrapy[1].

Mainly I set the 'starter_urls' that are the urls my robot will crawl/get and after I use a function 'parser' to retrieve/parse this data.

For parsing/retrieving you are going to need some html,lxml extractor as the 90% of your data will be that.

Now focusing in your question:

For data crawling

Scrapy
Requests [2]
Urllib [3]

For parsing data

Scrapy/lxml or scrapy+other
lxml[4]
beatiful soup [5]

And please remember 'crawling' and scrapping is not only for web, emails too. you can check another question about that here [6]

[1] = http://scrapy.org/

[2] - http://docs.python-requests.org/en/latest/

[3] - http://docs.python.org/library/urllib.html

[4] - http://lxml.de/

[5] - http://www.crummy.com/software/BeautifulSoup/

[6] - Python read my outlook email mailbox and parse messages

score 2 · Answer 3 · answered Oct 15 '12 at 07:29

The de facto real world HTML parser in Python is beautiful soup. The Python requests library is popular these days for HTTP (although the standard library has similar functionality but with a rather cumbersome API).

The scrappy and harvestman projects are real world crawlers that have been custom built just for the purpose of crawling.

score 1 · Answer 4 · answered Oct 15 '12 at 07:42

1

I also use Beautiful Soup, its very easy way how to parse HTML. When i was crawling some web pages i also use The ElementTree XML API. Personally, i really like The ElementTree library(its easy to parse XML).

answered Oct 15 '12 at 07:42

Dominika Koroncziova

11
1

Going from Ruby to Python : Crawlers

4 Answers4

Linked