3

I've started to learn python the past couple of days. I want to know the equivalent way of writing crawlers in python.

so In ruby I use:

  1. nokogiri for crawling html and getting content through css tags
  2. Net::HTTP and Net::HTTP::Get.new(uri.request_uri).body for getting JSON data from a url

what are equivalents of these in python?

zsquare
  • 9,916
  • 6
  • 53
  • 87
Matilda
  • 1,708
  • 3
  • 25
  • 33

4 Answers4

4
  1. Between lxml and beautiful soup, lxml is more equivalent to nokogiri because it is based on libxml2 and it has xpath/css support.
  2. The equivalent of net/http is urllib2
pguardiario
  • 53,827
  • 19
  • 119
  • 159
3

Well

Mainly you have to separate the 'scraper'/crawler the python lib/program/function that will download the files/data from the webserver and the Parser that will read this data and interpret the data. In my case I had to scrap and get some govt info that is 'open' but not download/data friendly. For this project I used scrapy[1].

Mainly I set the 'starter_urls' that are the urls my robot will crawl/get and after I use a function 'parser' to retrieve/parse this data.

For parsing/retrieving you are going to need some html,lxml extractor as the 90% of your data will be that.

Now focusing in your question:

For data crawling

  1. Scrapy
  2. Requests [2]
  3. Urllib [3]

For parsing data

  1. Scrapy/lxml or scrapy+other
  2. lxml[4]
  3. beatiful soup [5]

And please remember 'crawling' and scrapping is not only for web, emails too. you can check another question about that here [6]

[1] = http://scrapy.org/

[2] - http://docs.python-requests.org/en/latest/

[3] - http://docs.python.org/library/urllib.html

[4] - http://lxml.de/

[5] - http://www.crummy.com/software/BeautifulSoup/

[6] - Python read my outlook email mailbox and parse messages

Community
  • 1
  • 1
Carlos Henrique Cano
  • 1,458
  • 11
  • 15
2

The de facto real world HTML parser in Python is beautiful soup. The Python requests library is popular these days for HTTP (although the standard library has similar functionality but with a rather cumbersome API).

The scrappy and harvestman projects are real world crawlers that have been custom built just for the purpose of crawling.

Noufal Ibrahim
  • 71,383
  • 13
  • 135
  • 169
1

I also use Beautiful Soup, its very easy way how to parse HTML. When i was crawling some web pages i also use The ElementTree XML API. Personally, i really like The ElementTree library(its easy to parse XML).