Python find file download link on webpage

Question

I need a regex that will return to me the text contained between double quotes that starts with a specified text block, and ends with a specific file extension (say .txt). I'm using urllib2 to get the html of the page (the html is quite simple).

Basically if I have something like

<tr>
  <td valign="top"><img src="/icons/unknown.gif" alt="[   ]"></td>
  <td><a href="Client-8.txt">new_Client-8.txt</a></td>
  <td align="right">27-Jun-2012 18:02  </td>
</tr>

It should just return to me

Client-8.txt

Where the returned value is contained within double quotes. I know how the file name starts "Client-", and the file extension ".txt".

I'm playing around with r.search(regex, string) where the string I input is the html of the page. But I stink at regular expressions.

Thanks!

Time to link my favorite answer on SO again: http://stackoverflow.com/a/1732454/10077 — Fred Larson, Jun 29 '12 at 20:58
Well, that put an end to that. Now for something completely different! Thanks! — ZacAttack, Jun 29 '12 at 21:04

score 4 · Accepted Answer · answered Jun 29 '12 at 20:56

4

You should not use regular expressions for this task. It's far easier to write a script with BeautifulSoup to process the HTML and to find the element(s) you need.

In your case, you should search for all <a> elements whose href attribute starts with Client- and ends with .txt. That will give you a list of all files.

answered Jun 29 '12 at 20:56

Simeon Visser

118,920
18
185
180

I've been avoiding using beautifulSoup because I wanted to use only tools included in the basic python package. But since regex's aren't up to the task I guess I'll have to bite the bullet. Thanks! – ZacAttack Jun 29 '12 at 21:13
You can also parse HTML using Python's HTMLParser: http://docs.python.org/library/htmlparser.html . But the code will be longer than using BeautifulSoup (which was made specifically for scraping). – Simeon Visser Jun 29 '12 at 21:14
if you can use external libraries and already know css or jquery selectors pyquery is the best option. but for this task i would have used just a regex – gosom Jun 30 '12 at 10:07

score 1 · Answer 2 · answered Jun 29 '12 at 21:05

soup = BeautifulSoup('<tr><td valign="top"><img src="/icons/unknown.gif" alt="[   ]"></td><td><a href="Client-8.txt">new_Client-8.txt</a></td><td align="right">27-Jun-2012 18:02  </td>')
x=soup.findAll('a')
for i in x:
    if '.txt' in i['href']:
        print(i['href'])

Python find file download link on webpage

2 Answers2