4

I need a regex that will return to me the text contained between double quotes that starts with a specified text block, and ends with a specific file extension (say .txt). I'm using urllib2 to get the html of the page (the html is quite simple).

Basically if I have something like

<tr>
  <td valign="top"><img src="/icons/unknown.gif" alt="[   ]"></td>
  <td><a href="Client-8.txt">new_Client-8.txt</a></td>
  <td align="right">27-Jun-2012 18:02  </td>
</tr>

It should just return to me

Client-8.txt

Where the returned value is contained within double quotes. I know how the file name starts "Client-", and the file extension ".txt".

I'm playing around with r.search(regex, string) where the string I input is the html of the page. But I stink at regular expressions.

Thanks!

Simeon Visser
  • 118,920
  • 18
  • 185
  • 180
ZacAttack
  • 2,005
  • 5
  • 21
  • 34

2 Answers2

4

You should not use regular expressions for this task. It's far easier to write a script with BeautifulSoup to process the HTML and to find the element(s) you need.

In your case, you should search for all <a> elements whose href attribute starts with Client- and ends with .txt. That will give you a list of all files.

Simeon Visser
  • 118,920
  • 18
  • 185
  • 180
  • I've been avoiding using beautifulSoup because I wanted to use only tools included in the basic python package. But since regex's aren't up to the task I guess I'll have to bite the bullet. Thanks! – ZacAttack Jun 29 '12 at 21:13
  • You can also parse HTML using Python's HTMLParser: http://docs.python.org/library/htmlparser.html . But the code will be longer than using BeautifulSoup (which was made specifically for scraping). – Simeon Visser Jun 29 '12 at 21:14
  • if you can use external libraries and already know css or jquery selectors pyquery is the best option. but for this task i would have used just a regex – gosom Jun 30 '12 at 10:07
1
soup = BeautifulSoup('<tr><td valign="top"><img src="/icons/unknown.gif" alt="[   ]"></td><td><a href="Client-8.txt">new_Client-8.txt</a></td><td align="right">27-Jun-2012 18:02  </td>')
x=soup.findAll('a')
for i in x:
    if '.txt' in i['href']:
        print(i['href'])
Ashwini Chaudhary
  • 244,495
  • 58
  • 464
  • 504