Wikipedia philosophy game diagram in python and R

Question

So I am relatively new to python, and in order to learn, I have started writing a program that goes online to wikipedia, finds the first link in the overview section of a random article, follows that link and keeps going until it either enters a loop or finds the philosophy page (as detailed here) and then repeats this process for a new random article a specified number of times. I then want to collect the results in some form of useful data structure, so that I can pass the data to R using the Rpy library so that I can draw some sort of network diagram (R is pretty good at drawing things like that) with each node in the diagram representing the pages visited, and the arrows that paths taken from the starting article to the philosophy page.

So I have no problem getting python to return the fairly structured html from wiki but there are some problems that I can't quite figure out. Up till now I have selected the first link using a cssselector from the lxml library. It selects for the first link ( in an a tag) that is a direct descendant of a p tag, that is a direct descendant of a div tag with class="mw-content-ltr" like this:

    user_agent = 'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT)'
    values = {'name' : 'David Kavanagh',
      'location' : 'Belfast',
      'language' : 'Python' }
    headers = { 'User-Agent' : user_agent }
    encodes = urllib.urlencode(values)
    req = urllib2.Request(url, encodes, headers)
    page = urllib2.urlopen(req)
    root = parse(page).getroot()
    return root.cssselect("div.mw-content-ltr>p>a")[0].get('href')

This code resides in a function which I use to find the first link in the page. It works for the most part but the problem is if the first link is inside some other tag as opposed to being a direct descendant of a p tag like let's say a b tag or something then I miss it. As you can see from the wiki article above, links in italics or inside parentheses aren't eligible for the game, which means that I never get a link in italics (good) but frequently do get links that are inside parentheses (bad) and sometimes miss the first link on a page like the first link on the Chair article, which is stool, but it is in bold, so I don't get it. I have tried removing the direct descendant stipulation but then I frequently get links that are "above" the overview section, that are usually in the side box, in a p tag, in a table, in the same div as the overview section.

So the first part of my question is:

How could I use cssselectors or some other function or library to select the first link in the overview section that is not inside parentheses or in italics. I thought about using regular expressions to look through the raw html but that seems like a very clunky solution and I thought that there might be something a bit nicer out there that I haven't thought of.

So currently I am storing the results in a list of lists. So I have a list called paths, in which there are lists that contain strings that contain the title of the wiki article.

The second part of the question is: How can I traverse this list of lists to represent the multiple convergent paths? Is storing the results like this a good idea? Since the end diagram should look something like an upside down tree, I thought about making some kind of tree class, but that seems like a lot of work for something that is conceptually, fairly simple.

Any ideas or suggestions would be greatly appreciated.
Cheers,
Davy

Beautiful Soup can parse HTML better (IMO). A BS object has attributes (in object-oriented sense) that would return nested tags as well as attributes (in HTML tag sense) of the tag. Should be a cinch. Dont **ever** use regular expressions to parse HTML http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 — Jesvin Jose, Nov 08 '11 at 13:50

taleinat · Accepted Answer · 2011-11-08T18:17:20.337

I'll just answer the second question:

For a start, just keep a dict mapping one Wikipedia article title to the next. This will make it easy and fast to check I've you've hit an article you've already found before. Essentially this is just storing a directed graph's vertices, indexed by their origins.

If you get to the point where a Python dict is not efficient enough (it does have a significant memory overhead, once you have millions of items memory can be an issue) you can find a more efficient graph data structure to suit your needs.

EDIT

Ok, I'll answer the first question as well...

For the first part, I highly recommend using the MediaWiki API instead of getting the HTML version and parsing it. The API enables querying for certain types of links, for instance just inter-wiki links or just inter-language links. Additionally, there are Python client libraries for this API, which should make using it from Python code simple.

Don't parse a website's HTML if it provides a comprehensive and well-documented API!

The whole english wikipedia is less than 4M articles. You can prototype inserting 4M random 50-byte strings into a hash just to check if the interpreter will handle it easily. — Tadeusz A. Kadłubowski, Nov 08 '11 at 18:23

score 1 · Answer 2 · answered Nov 08 '11 at 12:18

For the first part, it's not possible to find brackets with css selectors, because as far as the html is concerned brackets are just text.

If I were you, I'd use selectors to find all the relevant paragraph elements that are valid for the game. Then, I'd look in the text of the paragraph element, and remove any that is not valid - for instance, anything between brackets, and anything between italic tags. I'd then search this processed text for the link elements I need. This is slightly nicer than manually processing the whole html document.

I'm not sure I follow exactly what you are doing for the second part, but as for representing the results of this search as a tree: This is a bad idea as you are looking for cycles, which trees can't represent.

For a data structure, I would have lists of 'nodes', where a node represents a page and has a URL and a count of occurances. I'd then use a brute force algorithm to compare lists of nodes - if the two lists have a node that are the same, you could merge them, increasing the 'occurances' count for each mirrored node.

I would not use the standard python 'list' as this can't loop back on itself. Maybe create your own linked list implementation to contain the nodes.

Wikipedia philosophy game diagram in python and R

2 Answers2