2

I am looking to use Python to scrape some data from my university's intranet and download all the research papers. I have looked at Python scraping before, but haven't really done any myself I'm sure I read about a Python scraping framework somewhere, should I use that?

So in essence this is what I need to scrape:

  1. Authors
  2. Description
  3. Field
  4. Then download the file and rename with the paper name.

I will then either put all this in xml or a database, most probably xml and then develop an interface etc at a later date.

Is this doable? Any ideas on where I should start?

Thanks in advance, LukeJenx

EDIT: The framework is Scrapy

EDIT: Turns out that I nearly killed the server today so a lecturer is getting me the copies from the Network team for me... Thanks!

LukeJenx
  • 63
  • 1
  • 10
  • 1
    Sure, its doable. I'd use something like [urllib2](http://docs.python.org/library/urllib2.html) to download the page, then python's [regular expression module **re**](http://docs.python.org/library/re.html) to extract the fields you need. (You could also use some XML parser, but I've found it easier to use regex when the html isn't proper XML). For database I'd probably just [pickle](http://docs.python.org/library/pickle.html) the data structures and save to file unless I needed to provide "external" access. – jedwards Oct 23 '12 at 19:47
  • Do pages use javascript to generate data you are interested in? – jfs Oct 23 '12 at 19:54
  • The pages are all in php I believe and all data is retrieved from the database. I haven't got access to the site now (I'm at home) so I can't confirm this however. – LukeJenx Oct 23 '12 at 20:05
  • If there's not JS generated content as @J.F.Sebastian has mentioned, then [lxml.html](http://lxml.de/lxmlhtml.html) is part of `lxml` and if cookies etc... is required, then combine that with [requests](http://docs.python-requests.org/en/latest/index.html) – Jon Clements Oct 23 '12 at 20:10

2 Answers2

2

Scrapy is a great framework, and has really good documentation as well. You should start there.

If you don't know XPaths, I'd recommend you learn them if you plan to use Scrapy (they're extremely easy!). XPaths help you "locate" specific elements inside the html that you'd want to extract.

Scrapy already has a built-in command line argument to export to xml, csv, etc. i.e. scrapy crawl <spidername> -o <filename> -t xml

Mechanize is another great option for writing scrapers easily.

Anuj Gupta
  • 10,056
  • 3
  • 28
  • 32
1

Yes, this is very do-able, although this depends a lot on the pages. As implied in the comments, a JS-heavy site could make this very difficult though.

That aside, for downloading use the standard urllib2, or look at Requests for a lighter, less painful experience.

However, best not to use regexes to parse HTML, it might cause a world of endless screaming. Seriously though, try BeautifulSoup instead - it's powerful and quite high-level.

For storage, whichever's easiest (to me XML seems overkill, consider the json library perhaps).

Community
  • 1
  • 1
declension
  • 4,110
  • 22
  • 25