beautifulsoup findall except certain text

Question

I am trying to scrap some pages on a website here is example of code

<meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>
<link href="/static/favicon-f8a3a024b0.ico" rel="shortcut icon"/>
<link href="/opensearch_ggs.xml" rel="search" title="WEBSITE anime GG" type="application/opensearchdescription+xml"/>
<link href="/opensearch_ggs2.xml" rel="search" title="WEBSITE music GG" type="application/opensearchdescription+xml"/>
<link href="/opensearch_artists.xml" rel="search" title="WEBSITE artists" type="application/opensearchdescription+xml"/>
<link href="/opensearch_requests.xml" rel="search" title="WEBSITE requests" type="application/opensearchdescription+xml"/>
<link href="/opensearch_forums.xml" rel="search" title="WEBSITE forums" type="application/opensearchdescription+xml"/>
<link href="/opensearch_users.xml" rel="search" title="WEBSITE users" type="application/opensearchdescription+xml"/>
<link href="/feed/rss_ggs_all/GOODSTUFF" rel="alternate" title="WEBSITE - All GG" type="application/rss+xml"/>
<link href="/feed/rss_ggs_anime/GOODSTUFF" rel="alternate" title="WEBSITE - Anime GG" type="application/rss+xml"/>
<span class="download_link">[<a href="https://WEBSITE.tv/GG/223197/download/GOODSTUFF" title="Download">DL</a>]</span>  
<span class="download_link">[<a href="https://WEBSITE.tv/GG/223197/download/GOODSTUFF" title="Download">DL</a>]</span>

Here is what I'm working with

        for x in range(pages):
                pagen += 1
                url3 = url2[:40] + str(pagen) + url2[41:]
                print "url3 = ", url3
                ggs = br.open(url3)
                #print "ggs = ", ggs.read()
                soup = BeautifulSoup(ggs, "lxml")
                print "soup = ", soup
                trueurl = 'https://WEBSITE.tv'
                #print trueurl

                        # Finds the gg links

                download = soup.find_all(href=re.compile("GOODSTUFF"))
#               print "download = ", download
                #print 'download'
                        # For-Loop to download the ggs

                for link in download:
                        sleep(10)
                        print 'loop'
                        gglink = link.get('href')
                        gglink = trueurl + gglink
                        print gglink
                        hdr = {'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.64 Safari/537.11',
                                'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8'}
                        req = urllib2.Request(gglink, headers=hdr)
                        print req
                        res_init()
                        res = urllib2.urlopen(req)
                        #print res
                        directory = "/home/cyber/yen/" # gg directory, change as you please.
                        file += 1
                        print "Page", pagen, "of", pageout, ".....", file, 'ggs downloaded'
                        urllib.urlretrieve(gglink, directory + 'page' + str(pagen) + '_gg' + str(file) + ".gg")

I only want to download

https://WEBSITE.tv/GG/223197/download/GOODSTUFF

but it also grabs

/feed/rss_ggs_anime/GOODSTUFF

I do not want that.

The problem is findall matchs everything with GOODSTUFF, I tried to lessen it but doing this

                for download in soup.find_all(href=re.compile("GOODSTUFF")):
                    if download.find("feed"):
                        continue

and it doesn't catch anything, tried rss too isntead of feed some results

SIM · Answer 1 · 2017-10-16T05:46:59.747

You can try like this if the html element are always something like you pasted above:

html="""
<meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>
<link href="/static/favicon-f8a3a024b0.ico" rel="shortcut icon"/>
<link href="/opensearch_ggs.xml" rel="search" title="WEBSITE anime GG" type="application/opensearchdescription+xml"/>
<link href="/opensearch_ggs2.xml" rel="search" title="WEBSITE music GG" type="application/opensearchdescription+xml"/>
<link href="/opensearch_artists.xml" rel="search" title="WEBSITE artists" type="application/opensearchdescription+xml"/>
<link href="/opensearch_requests.xml" rel="search" title="WEBSITE requests" type="application/opensearchdescription+xml"/>
<link href="/opensearch_forums.xml" rel="search" title="WEBSITE forums" type="application/opensearchdescription+xml"/>
<link href="/opensearch_users.xml" rel="search" title="WEBSITE users" type="application/opensearchdescription+xml"/>
<link href="/feed/rss_ggs_all/GOODSTUFF" rel="alternate" title="WEBSITE - All GG" type="application/rss+xml"/>
<link href="/feed/rss_ggs_anime/GOODSTUFF" rel="alternate" title="WEBSITE - Anime GG" type="application/rss+xml"/>
<span class="download_link">[<a href="https://WEBSITE.tv/GG/223197/download/GOODSTUFF" title="Download">DL</a>]</span>  
<span class="download_link">[<a href="https://WEBSITE.tv/GG/223197/download/GOODSTUFF" title="Download">DL</a>]</span>
"""

from bs4 import BeautifulSoup

soup = BeautifulSoup(html,"lxml")
for link in soup.select(".download_link a"):
    print(link['href'])

Tried that and got this Traceback (most recent call last): File "test.py", line 109, in main(url) File "test.py", line 96, in main urllib.urlretrieve(gglink, directory + 'page' + str(pagen) + '_gg' + str(file) + ".gg") File "/usr/lib/python2.7/urllib.py", line 483, in open_local_file raise IOError(e.errno, e.strerror, e.filename) IOError: [Errno 2] No such file or directory: '/feed/rss_ggs_all/GOODSTUFF' — gamer, Oct 16 '17 at 05:30
Now try that and let me know. Paste the full code with html element in the IDE and then run it. — SIM, Oct 16 '17 at 05:48

score 0 · Accepted Answer · answered Oct 16 '17 at 05:42

You just need to modify your regex in this case. When you write re.compile("GOODSTUFF") it will match all contents containing GOODSTUFF as a substring.

So better way I suggest you to modify your regex to:

re.compile("http(?:s)://(.*)/GOODSTUFF")

above regex will give you your desire output as follow(only two tags with download links):

[<a href="https://WEBSITE.tv/GG/223197/download/GOODSTUFF" title="Download">DL</a>, <a href="https://WEBSITE.tv/GG/223197/download/GOODSTUFF" title="Download">DL</a>]

Full Snippet:

html = """<meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>
<link href="/static/favicon-f8a3a024b0.ico" rel="shortcut icon"/>
<link href="/opensearch_ggs.xml" rel="search" title="WEBSITE anime GG" type="application/opensearchdescription+xml"/>
<link href="/opensearch_ggs2.xml" rel="search" title="WEBSITE music GG" type="application/opensearchdescription+xml"/>
<link href="/opensearch_artists.xml" rel="search" title="WEBSITE artists" type="application/opensearchdescription+xml"/>
<link href="/opensearch_requests.xml" rel="search" title="WEBSITE requests" type="application/opensearchdescription+xml"/>
<link href="/opensearch_forums.xml" rel="search" title="WEBSITE forums" type="application/opensearchdescription+xml"/>
<link href="/opensearch_users.xml" rel="search" title="WEBSITE users" type="application/opensearchdescription+xml"/>
<link href="/feed/rss_ggs_all/GOODSTUFF" rel="alternate" title="WEBSITE - All GG" type="application/rss+xml"/>
<link href="/feed/rss_ggs_anime/GOODSTUFF" rel="alternate" title="WEBSITE - Anime GG" type="application/rss+xml"/>
<span class="download_link">[<a href="https://WEBSITE.tv/GG/223197/download/GOODSTUFF" title="Download">DL</a>]</span>  
<span class="download_link">[<a href="https://WEBSITE.tv/GG/223197/download/GOODSTUFF" title="Download">DL</a>]</span>"""

from bs4 import BeautifulSoup
import re

soup = BeautifulSoup(html, "lxml")
download_links = soup.find_all(href=re.compile("http(?:s)://(.*)/GOODSTUFF"))
for link in download_links:
    # your download code here
    # download(link)

moreover with only regex you can directly get your stuff without using BeautifulSoup if you use:

download_links = [i[0] for i in re.findall("(http(?:s)://(.*)/GOODSTUFF)", html)]

Result of above line will be:

['https://WEBSITE.tv/GG/223197/download/GOODSTUFF', 'https://WEBSITE.tv/GG/223197/download/GOODSTUFF']

Now I get this error? Traceback (most recent call last): File "test.py", line 107, in main(url) File "test.py", line 89, in main res = urllib2.urlopen(req) File "/usr/lib/python2.7/urllib2.py", line 1198, in do_open raise URLError(err) urllib2.URLError: — gamer, Oct 16 '17 at 05:57
it's while opening url you are getting traceback not due to regex. it depends on what exactly are you passing for url download?.. can help you more if you provide actual url for it.. but as said off your trouble which is asked in this question is solved — Gahan, Oct 16 '17 at 06:01
i accepted your anwser bc it actually fixed it, but I am getting this SSL error even tho SSL is fully updated... any ideas? — gamer, Oct 16 '17 at 06:04
here you may find various useful answers which might be helpful for your task: https://stackoverflow.com/questions/22676/how-do-i-download-a-file-over-http-using-python — Gahan, Oct 16 '17 at 06:04

beautifulsoup findall except certain text

2 Answers2