1

I'm trying to do a little bit of HTML parsing in Python which I'm horrible at to be quite honest. I've been up googling ways to do this but can't get anything to work. Here is my situation. I have a web page that has a BUNCH of links to downloads. What I want to do is specify a search string, and if the string I am searching for is there, download the file. But it needs to get the entire file name. For example if I am searching for game-1 and the name of the actual game is game-1-something-else, I want it to download game-1-1something-else. I have already used the following code to obtain the source of the page:


import urllib2
file = urllib2.urlopen('http://www.example.com/my/example/dir')
dload = file.read()
This grabs the entire source code of the webpage which is just a directory by itself. For example, I have tons of tags. I have <a href tags, <td> tags, etc. I want to string the tags so all I have is a list of the files in the directory of the web page, then I want to use a regular expression or something simliar to search for what I am searching for, take the entire file name, and download it.
Jmariz
  • 235
  • 1
  • 3
  • 8
  • `lxml.html` is your friend. Likewise XPath. – Charles Duffy Apr 16 '11 at 03:32
  • You cannot use regular expressions to parse HTML. Really. Never. Beau--ootiful Soo-oop! Beau--ootiful Soo-oop! Soo--oop of the e--e--evening, Beautiful, beautiful Soup! – msw Apr 16 '11 at 05:38

1 Answers1

1

Once you have the HTML data, parse it and then you can make selections of nodes within the page:

import lxml.html
tree = lxml.html.fromstring(dload)
for node in tree.xpath('//a'):
    print node['href']
samplebias
  • 37,113
  • 6
  • 107
  • 103
  • Of course, you'll need to have [lxml](http://lxml.de/) installed, since it doesn't ship with Python. ... Who do we bribe to get Tkinter dropped and lxml added? – Mike DeSimone Apr 16 '11 at 04:35