2

I am trying to parse a HTML page using BeautifulSoup which has text files, ending with the .txt extension. I want to parse the HTML, and fetch the string that ends with .txt.

All such strings are within a <a href> tag and here are some examples:

<a href = "foo.txt">

<a href = "bar.txt">

How do I get the foo.txt and bar.txt.

I did this:

>>> links = soup.findAll('a')

But I can't find how to extract the complete string... Any suggestions?

user225312
  • 126,773
  • 69
  • 172
  • 181

1 Answers1

8

BeautifulSoup accepts regexps as parameter form find() and findAll() This should work:

links = soup.findAll(href=re.compile("\.txt$"))
vartec
  • 131,205
  • 36
  • 218
  • 244
  • 1
    I think it should be : `soup.findAll('a' , href=...` – mouad May 30 '11 at 10:06
  • Hmm. What is the difference (if any), between what vartec and mouad have suggested. – user225312 May 30 '11 at 10:07
  • @A A: My suggestion actually search for all the `a` tags that have `href ="*.txt"` , @vartec solution check for all the tags that have `href="*.txt"`. – mouad May 30 '11 at 10:19
  • @A: my version in theory would catch any tag with `href` attr. Thing is, in HTML the only tag with `href` is `` – vartec May 30 '11 at 10:21
  • @vartec, @mouad: Oh ok! So that is a non issue. One thing I don't understand is that, even in the case of BeautifulSoup, we are using a regular expression. So why not just use it in the first place directly? – user225312 May 30 '11 at 10:22
  • @A A: http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags – mouad May 30 '11 at 10:26
  • @A: because such using Soup, you only regexp contents of `href`. To have regexp on whole document (w/o Soup) would be extremely complicated and not as efficient – vartec May 30 '11 at 10:26
  • Aah ok. I get it. Thanks. I just tried it using that and see where I was wrong. – user225312 May 30 '11 at 10:27