Parsing Wikipedia recursively and fetch text from included links

Question

Because I hate clicking forth and back reading through Wikipedia articles I am trying to build a tool to create "expanded Wikipedia articles" according to the following algorithm:

Create two variables: Depth and Length.
Set a Wikipedia article as a seed page
Parse through this article: Whenever there is a link to another article fetch the first Length sentences and include it in the original article (e.g. in brackets or otherwise highlighted).
Do this recursively up to a certain Depth, i.e. not deeper than two levels.

The result would be an article that could be read in one go without always clicking to and fro...

How would you build such a mechanism in Python? Which libraries should be used (are there any for such tasks)? Are there any helpful tutorials?

score 3 · Accepted Answer · answered Sep 28 '12 at 09:31

You can use urllib2 for requesting the url. For parsing the htmlpage there is wonderful library for you called BeautifulSoup. One thing you need to consider is that while scanning Wikipedia with your crawler you need to add a header alongwith your request. Or else Wikipedia will simply dissallow to be crawled.

 request = urllib2.Request(page)

adding header

 request.add_header('User-agent', 'Mozilla/5.0 (Linux i686)')

and then load the page and give it to BeautifulSoup.

 soup = BeautifulSoup(response)  
 text = soup.get_text()

this will give you the links in a page

 for url in soup.find_all('a',attrs={'href': re.compile("^http://")}):  
       link = url['href']

And now regarding the algorithm for crawling Wikipedia what you want is something called Depth Limited Search. A pseudocode is provided in the same page which is easy to follow.

And other functionality of the said libraries can be googled and are easy to follow. Good luck.

score 2 · Answer 2 · edited May 23 '17 at 11:59

You may want to try Mechanize for this - it's a bit higher-level than urllib and other built-in libraries. In particular, it's easy to navigate around just like you're using a browser, with commands like follow_link() and back().

To get the lines you want, have a look at the sources of a few Wikipedia pages to see where the summary starts in the HTML page - from a quick browse, I think you want to find the div with id "mw-content-text" and get the text from the first <p> element. As others have mentioned, Beautiful Soup would be good at this.

Alternatively, you could try one of the Python libraries that work with Wikipedia - there's a list here: http://en.wikipedia.org/wiki/Wikipedia%3aCreating_a_bot#Python, and some recommendations in other stackoverflow answers.

Sounds like a fun little project, good luck!

score 1 · Answer 3 · answered Sep 28 '12 at 09:16

1

Use BeautifulSoup or Scrapy to parse the html pages. Use urllib or requests to get the nested pages. You may need to use a few regular expressions to massage or evaluate the extracted links.

answered Sep 28 '12 at 09:16

Hans Then

10,935
3
32
51

score 1 · Answer 4 · answered Sep 28 '12 at 09:18

1

You could parse the html or you could parse the raw version looking for [[Link]]. Either way, you should talk a look at:

urllib or requests

answered Sep 28 '12 at 09:18

oz123

27,559
27
125
187

score 1 · Answer 5 · answered Feb 04 '16 at 23:21

1

use the wikipedia python library which lets you see the links on the pages including the links in the "see also" section and you can iterate through them and use the library to get the content on them. https://pypi.python.org/pypi/wikipedia

answered Feb 04 '16 at 23:21

roopalgarg

429
1
6
19

Parsing Wikipedia recursively and fetch text from included links

5 Answers5