0

I am trying to parse text from webpages, starting at this page. This page has links to the final page (this can be moved in to text file manually also; in order to avoid extra effort on coding). On the final page on the left hand side there is a page index. And each page also has a page index. The page index is in the top section of each page. From this item list I just need to extract a line starting with 'Configuring' , 'Configuration Examples' or 'Example'.

This task appears to be so simple when doing manually but it's daunting and hard to track. If this information can be extracted from any tool that crawls and logs the items in hierarchical order as it finds. Probably in some simple format where it also includes hyperlink or at least just as normal tab-separated text file.

The information on the webpage is public and is downloadable. If it is hard to extract via web may be I can also try to get download those and try as offline.

I tried to do research on this requirement and look LinksGrabber, WebParser, BeautifulSoup or parsing text with regex could do tweaks. But I am still a few lightsyears away from the implementation of this idea.

Is this what I am trying is achievable with Python or what would be the realist way to approach on this.

PS: I understand this is web scraping, but I am just doing this for personal education purposes and it's not holding commercial value or any association.

Martijn Pieters
  • 1,048,767
  • 296
  • 4,058
  • 3,343
user1582596
  • 503
  • 2
  • 5
  • 16
  • 6
    Whatever you do, don't use [regular expressions to parse HTML](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags).. – Martijn Pieters Sep 08 '12 at 10:41
  • 1
    lxml is the way I'd do this. See my answer here: http://stackoverflow.com/questions/12073781/parsing-html-documents-using-lxml-in-python/12073964#12073964 – Steve Mayne Sep 08 '12 at 10:41
  • 1
    And take a look at [Scrapy](http://scrapy.org/), which automates web scraping for you. – Martijn Pieters Sep 08 '12 at 10:42
  • [BeautifulSoup](http://www.crummy.com/software/BeautifulSoup/) is also pretty popular, powerful and simple. And in pure python. – gorlum0 Sep 08 '12 at 10:45

1 Answers1

1

You should try Scrapy. There you can set up model which will contain the data you want from the page, e.g.

from scrapy.item import Item, Field

class Torrent(Item):
    url = Field()
    name = Field()
    description = Field()
    size = Field()

The you can write a spider that scrapes this data. Scrapy at a glance

webjunkie
  • 6,891
  • 7
  • 46
  • 43
  • 10q for the input, here is my URL: _http://tinyurl.com/mp238t that i am trying for. wondering if you give some more close example. definitely i will try by myself but my pace would be very slow as this is totally new to me. – user1582596 Sep 08 '12 at 11:47
  • @user1582596 it is all new to everyone at some point. Think of it as an opportunity to learn rather than a place to use someone else's code that you don't understand. – msw Sep 08 '12 at 12:28