So I am relatively new to python, and in order to learn, I have started writing a program that goes online to wikipedia, finds the first link in the overview section of a random article, follows that link and keeps going until it either enters a loop or finds the philosophy page (as detailed here) and then repeats this process for a new random article a specified number of times. I then want to collect the results in some form of useful data structure, so that I can pass the data to R using the Rpy library so that I can draw some sort of network diagram (R is pretty good at drawing things like that) with each node in the diagram representing the pages visited, and the arrows that paths taken from the starting article to the philosophy page.
So I have no problem getting python to return the fairly structured html from wiki but there are some problems that I can't quite figure out. Up till now I have selected the first link using a cssselector from the lxml library. It selects for the first link ( in an a tag) that is a direct descendant of a p tag, that is a direct descendant of a div tag with class="mw-content-ltr" like this:
user_agent = 'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT)'
values = {'name' : 'David Kavanagh',
'location' : 'Belfast',
'language' : 'Python' }
headers = { 'User-Agent' : user_agent }
encodes = urllib.urlencode(values)
req = urllib2.Request(url, encodes, headers)
page = urllib2.urlopen(req)
root = parse(page).getroot()
return root.cssselect("div.mw-content-ltr>p>a")[0].get('href')
This code resides in a function which I use to find the first link in the page. It works for the most part but the problem is if the first link is inside some other tag as opposed to being a direct descendant of a p tag like let's say a b tag or something then I miss it. As you can see from the wiki article above, links in italics or inside parentheses aren't eligible for the game, which means that I never get a link in italics (good) but frequently do get links that are inside parentheses (bad) and sometimes miss the first link on a page like the first link on the Chair article, which is stool, but it is in bold, so I don't get it. I have tried removing the direct descendant stipulation but then I frequently get links that are "above" the overview section, that are usually in the side box, in a p tag, in a table, in the same div as the overview section.
So the first part of my question is:
How could I use cssselectors or some other function or library to select the first link in the overview section that is not inside parentheses or in italics. I thought about using regular expressions to look through the raw html but that seems like a very clunky solution and I thought that there might be something a bit nicer out there that I haven't thought of.
So currently I am storing the results in a list of lists. So I have a list called paths, in which there are lists that contain strings that contain the title of the wiki article.
The second part of the question is: How can I traverse this list of lists to represent the multiple convergent paths? Is storing the results like this a good idea? Since the end diagram should look something like an upside down tree, I thought about making some kind of tree class, but that seems like a lot of work for something that is conceptually, fairly simple.
Any ideas or suggestions would be greatly appreciated.
Cheers,
Davy