Getting filenames matching an extension using BeautifulSoup

Question

I am trying to parse a HTML page using BeautifulSoup which has text files, ending with the .txt extension. I want to parse the HTML, and fetch the string that ends with .txt.

All such strings are within a <a href> tag and here are some examples:

<a href = "foo.txt">

<a href = "bar.txt">

How do I get the foo.txt and bar.txt.

I did this:

>>> links = soup.findAll('a')

But I can't find how to extract the complete string... Any suggestions?

score 8 · Accepted Answer · answered May 30 '11 at 10:04

8

BeautifulSoup accepts regexps as parameter form find() and findAll() This should work:

links = soup.findAll(href=re.compile("\.txt$"))

answered May 30 '11 at 10:04

vartec

131,205
36
218
244

1

I think it should be : `soup.findAll('a' , href=...` – mouad May 30 '11 at 10:06
Hmm. What is the difference (if any), between what vartec and mouad have suggested. – user225312 May 30 '11 at 10:07
@A A: My suggestion actually search for all the `a` tags that have `href ="*.txt"` , @vartec solution check for all the tags that have `href="*.txt"`. – mouad May 30 '11 at 10:19
@A: my version in theory would catch any tag with `href` attr. Thing is, in HTML the only tag with `href` is `` – vartec May 30 '11 at 10:21
@vartec, @mouad: Oh ok! So that is a non issue. One thing I don't understand is that, even in the case of BeautifulSoup, we are using a regular expression. So why not just use it in the first place directly? – user225312 May 30 '11 at 10:22
@A A: http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags – mouad May 30 '11 at 10:26
@A: because such using Soup, you only regexp contents of `href`. To have regexp on whole document (w/o Soup) would be extremely complicated and not as efficient – vartec May 30 '11 at 10:26
Aah ok. I get it. Thanks. I just tried it using that and see where I was wrong. – user225312 May 30 '11 at 10:27

Getting filenames matching an extension using BeautifulSoup

1 Answers1